RE: Deserializing map column via JDBC (HIVE-1378)

2010-09-02 Thread Steven Wong
> The simplest thing to do is to:
> 1. Rename "useJSONforLazy" to "useDelimitedJSON";
> 2. Use "DelimitedJSONSerDe" when useDelimitedJSON = true;

So, DelimitedJSONSerDe will need the same deserialization capability as 
LazySimpleSerDe?


-Original Message-
From: Zheng Shao [mailto:zs...@facebook.com] 
Sent: Thursday, September 02, 2010 7:19 PM
To: Steven Wong; hive-dev@hadoop.apache.org
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Earlier there was no multi-level delimited format - the only way is first-level 
delimited, and then JSON.
Some legacy scripts/apps have been written to work with that.

Later we introduced multi-level delimited format, and made the hack to put them 
together.

Zheng
-Original Message-
From: Steven Wong [mailto:sw...@netflix.com] 
Sent: Friday, September 03, 2010 10:17 AM
To: Zheng Shao; hive-dev@hadoop.apache.org
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Why was/is useJSONforLazy needed? What's the historical background?


-Original Message-
From: Zheng Shao [mailto:zs...@facebook.com] 
Sent: Thursday, September 02, 2010 7:11 PM
To: Steven Wong; hive-dev@hadoop.apache.org
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

The simplest thing to do is to:
1. Rename "useJSONforLazy" to "useDelimitedJSON";
2. Use "DelimitedJSONSerDe" when useDelimitedJSON = true;

Zheng
-Original Message-
From: Steven Wong [mailto:sw...@netflix.com] 
Sent: Friday, September 03, 2010 10:05 AM
To: Zheng Shao; hive-dev@hadoop.apache.org
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Zheng,

In LazySimpleSerDe.initSerdeParams:

String useJsonSerialize = tbl
.getProperty(Constants.SERIALIZATION_USE_JSON_OBJECTS);
serdeParams.jsonSerialize = (useJsonSerialize != null && useJsonSerialize
.equalsIgnoreCase("true"));

SERIALIZATION_USE_JSON_OBJECTS is set to true in PlanUtis.getTableDesc:

// It is not a very clean way, and should be modified later - due to
// compatiblity reasons,
// user sees the results as json for custom scripts and has no way for
// specifying that.
// Right now, it is hard-coded in the code
if (useJSONForLazy) {
  properties.setProperty(Constants.SERIALIZATION_USE_JSON_OBJECTS, "true");
}

useJSONForLazy is true in the following 2 calls to PlanUtis.getTableDesc:

SemanticAnalyzer.genScriptPlan -> PlanUtis.getTableDesc
SemanticAnalyzer.genScriptPlan -> SemanticAnalyzer.getTableDescFromSerDe -> 
PlanUtis.getTableDesc

What is it all about and how should we untangle it (ideally get rid of 
SERIALIZATION_USE_JSON_OBJECTS)?

Thanks.
Steven


-Original Message-
From: Zheng Shao [mailto:zs...@facebook.com] 
Sent: Wednesday, September 01, 2010 6:45 PM
To: Steven Wong; hive-dev@hadoop.apache.org; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Hi Steven,

As far as I remember, the only use case of JSON logic in LazySimpleSerDe is the 
FetchTask.   Even if there are other cases, we should be able to catch it in 
unit tests.

The potential risk is small enough, and the benefit of cleaning it up is pretty 
big - it makes the code much easier to understand.

Thanks for getting to it Steven!  I am very happy to see that this finally gets 
cleaned up!

Zheng
-Original Message-
From: Steven Wong [mailto:sw...@netflix.com] 
Sent: Thursday, September 02, 2010 7:45 AM
To: Zheng Shao; hive-dev@hadoop.apache.org; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Your suggestion is in line with my earlier proposal of fixing FetchTask. The 
only major difference is the moving of the JSON-related logic from 
LazySimpleSerDe to a new serde called DelimitedJSONSerDe.

Is it safe to get rid of the JSON-related logic in LazySimpleSerDe? Sounds like 
you're implying that it is safe, but I'd like to confirm with you. I don't 
really know whether there are components other than FetchTask that rely on 
LazySimpleSerDe and its JSON capability (the useJSONSerialize flag doesn't have 
to be true for LazySimpleSerDe to use JSON).

If it is safe, I am totally fine with introducing DelimitedJSONSerDe.

Combining your suggestion and my proposal would look like:

0. Move JSON serialization logic from LazySimpleSerDe to a new serde called 
DelimitedJSONSerDe.
1. By default, hive.fetch.output.serde = DelimitedJSONSerDe.
2. When JDBC driver connects to Hive server, execute "set 
hive.fetch.output.serde = LazySimpleSerDe".
3. In Hive server:
  (a) If hive.fetch.output.serde == DelimitedJSONSerDe, FetchTask uses 
DelimitedJSONSerDe to maintain today's serialization behavior (tab for field 
delimiter tab, "NULL" for null sequence, JSON for non-primitives).
  (b) If hive.fetch.output.serde == LazySimpleSerDe, FetchTask uses 
LazySimpleSerDe with a schema to ctrl-delimit everything.
4. JDBC driver deserializes with LazySimpleSerDe instead of DynamicSerDe.

Steven


-Original Message-
From:

RE: Deserializing map column via JDBC (HIVE-1378)

2010-09-02 Thread Steven Wong
Why was/is useJSONforLazy needed? What's the historical background?


-Original Message-
From: Zheng Shao [mailto:zs...@facebook.com] 
Sent: Thursday, September 02, 2010 7:11 PM
To: Steven Wong; hive-dev@hadoop.apache.org
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

The simplest thing to do is to:
1. Rename "useJSONforLazy" to "useDelimitedJSON";
2. Use "DelimitedJSONSerDe" when useDelimitedJSON = true;

Zheng
-Original Message-
From: Steven Wong [mailto:sw...@netflix.com] 
Sent: Friday, September 03, 2010 10:05 AM
To: Zheng Shao; hive-dev@hadoop.apache.org
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Zheng,

In LazySimpleSerDe.initSerdeParams:

String useJsonSerialize = tbl
.getProperty(Constants.SERIALIZATION_USE_JSON_OBJECTS);
serdeParams.jsonSerialize = (useJsonSerialize != null && useJsonSerialize
.equalsIgnoreCase("true"));

SERIALIZATION_USE_JSON_OBJECTS is set to true in PlanUtis.getTableDesc:

// It is not a very clean way, and should be modified later - due to
// compatiblity reasons,
// user sees the results as json for custom scripts and has no way for
// specifying that.
// Right now, it is hard-coded in the code
if (useJSONForLazy) {
  properties.setProperty(Constants.SERIALIZATION_USE_JSON_OBJECTS, "true");
}

useJSONForLazy is true in the following 2 calls to PlanUtis.getTableDesc:

SemanticAnalyzer.genScriptPlan -> PlanUtis.getTableDesc
SemanticAnalyzer.genScriptPlan -> SemanticAnalyzer.getTableDescFromSerDe -> 
PlanUtis.getTableDesc

What is it all about and how should we untangle it (ideally get rid of 
SERIALIZATION_USE_JSON_OBJECTS)?

Thanks.
Steven


-Original Message-
From: Zheng Shao [mailto:zs...@facebook.com] 
Sent: Wednesday, September 01, 2010 6:45 PM
To: Steven Wong; hive-dev@hadoop.apache.org; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Hi Steven,

As far as I remember, the only use case of JSON logic in LazySimpleSerDe is the 
FetchTask.   Even if there are other cases, we should be able to catch it in 
unit tests.

The potential risk is small enough, and the benefit of cleaning it up is pretty 
big - it makes the code much easier to understand.

Thanks for getting to it Steven!  I am very happy to see that this finally gets 
cleaned up!

Zheng
-Original Message-
From: Steven Wong [mailto:sw...@netflix.com] 
Sent: Thursday, September 02, 2010 7:45 AM
To: Zheng Shao; hive-dev@hadoop.apache.org; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Your suggestion is in line with my earlier proposal of fixing FetchTask. The 
only major difference is the moving of the JSON-related logic from 
LazySimpleSerDe to a new serde called DelimitedJSONSerDe.

Is it safe to get rid of the JSON-related logic in LazySimpleSerDe? Sounds like 
you're implying that it is safe, but I'd like to confirm with you. I don't 
really know whether there are components other than FetchTask that rely on 
LazySimpleSerDe and its JSON capability (the useJSONSerialize flag doesn't have 
to be true for LazySimpleSerDe to use JSON).

If it is safe, I am totally fine with introducing DelimitedJSONSerDe.

Combining your suggestion and my proposal would look like:

0. Move JSON serialization logic from LazySimpleSerDe to a new serde called 
DelimitedJSONSerDe.
1. By default, hive.fetch.output.serde = DelimitedJSONSerDe.
2. When JDBC driver connects to Hive server, execute "set 
hive.fetch.output.serde = LazySimpleSerDe".
3. In Hive server:
  (a) If hive.fetch.output.serde == DelimitedJSONSerDe, FetchTask uses 
DelimitedJSONSerDe to maintain today's serialization behavior (tab for field 
delimiter tab, "NULL" for null sequence, JSON for non-primitives).
  (b) If hive.fetch.output.serde == LazySimpleSerDe, FetchTask uses 
LazySimpleSerDe with a schema to ctrl-delimit everything.
4. JDBC driver deserializes with LazySimpleSerDe instead of DynamicSerDe.

Steven


-Original Message-
From: Zheng Shao [mailto:zs...@facebook.com] 
Sent: Wednesday, September 01, 2010 3:22 AM
To: Steven Wong; hive-dev@hadoop.apache.org; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Hi Steven,

Sorry for the late reply. The email slipped my eye...


This issue was brought up multiple times.  In my opinion, using JSON in 
LazySimpleSerDe (inherited from ColumnsetSerDe, MetadataColumnsetSerDe, 
DynamicSerDe) was a long-time legacy problem that never got fixed.   
LazySimpleSerDe was supposed to do delimited format only.


The cleanest way to do that is to:
1. Get rid of the JSON-related logic in LazySimpleSerDe;
2. Introduce another "DelimitedJSONSerDe" (without deserialization capability) 
that does JSON serialization for complex fields.  (We never have or need 
deserialization for JSON yet)
3. Configure the FetchTask to use the new SerDe by default, and LazySimpleSerDe 
in case it's JDBC.  This 

RE: Deserializing map column via JDBC (HIVE-1378)

2010-09-02 Thread Steven Wong
Zheng,

In LazySimpleSerDe.initSerdeParams:

String useJsonSerialize = tbl
.getProperty(Constants.SERIALIZATION_USE_JSON_OBJECTS);
serdeParams.jsonSerialize = (useJsonSerialize != null && useJsonSerialize
.equalsIgnoreCase("true"));

SERIALIZATION_USE_JSON_OBJECTS is set to true in PlanUtis.getTableDesc:

// It is not a very clean way, and should be modified later - due to
// compatiblity reasons,
// user sees the results as json for custom scripts and has no way for
// specifying that.
// Right now, it is hard-coded in the code
if (useJSONForLazy) {
  properties.setProperty(Constants.SERIALIZATION_USE_JSON_OBJECTS, "true");
}

useJSONForLazy is true in the following 2 calls to PlanUtis.getTableDesc:

SemanticAnalyzer.genScriptPlan -> PlanUtis.getTableDesc
SemanticAnalyzer.genScriptPlan -> SemanticAnalyzer.getTableDescFromSerDe -> 
PlanUtis.getTableDesc

What is it all about and how should we untangle it (ideally get rid of 
SERIALIZATION_USE_JSON_OBJECTS)?

Thanks.
Steven


-Original Message-
From: Zheng Shao [mailto:zs...@facebook.com] 
Sent: Wednesday, September 01, 2010 6:45 PM
To: Steven Wong; hive-dev@hadoop.apache.org; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Hi Steven,

As far as I remember, the only use case of JSON logic in LazySimpleSerDe is the 
FetchTask.   Even if there are other cases, we should be able to catch it in 
unit tests.

The potential risk is small enough, and the benefit of cleaning it up is pretty 
big - it makes the code much easier to understand.

Thanks for getting to it Steven!  I am very happy to see that this finally gets 
cleaned up!

Zheng
-Original Message-
From: Steven Wong [mailto:sw...@netflix.com] 
Sent: Thursday, September 02, 2010 7:45 AM
To: Zheng Shao; hive-dev@hadoop.apache.org; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Your suggestion is in line with my earlier proposal of fixing FetchTask. The 
only major difference is the moving of the JSON-related logic from 
LazySimpleSerDe to a new serde called DelimitedJSONSerDe.

Is it safe to get rid of the JSON-related logic in LazySimpleSerDe? Sounds like 
you're implying that it is safe, but I'd like to confirm with you. I don't 
really know whether there are components other than FetchTask that rely on 
LazySimpleSerDe and its JSON capability (the useJSONSerialize flag doesn't have 
to be true for LazySimpleSerDe to use JSON).

If it is safe, I am totally fine with introducing DelimitedJSONSerDe.

Combining your suggestion and my proposal would look like:

0. Move JSON serialization logic from LazySimpleSerDe to a new serde called 
DelimitedJSONSerDe.
1. By default, hive.fetch.output.serde = DelimitedJSONSerDe.
2. When JDBC driver connects to Hive server, execute "set 
hive.fetch.output.serde = LazySimpleSerDe".
3. In Hive server:
  (a) If hive.fetch.output.serde == DelimitedJSONSerDe, FetchTask uses 
DelimitedJSONSerDe to maintain today's serialization behavior (tab for field 
delimiter tab, "NULL" for null sequence, JSON for non-primitives).
  (b) If hive.fetch.output.serde == LazySimpleSerDe, FetchTask uses 
LazySimpleSerDe with a schema to ctrl-delimit everything.
4. JDBC driver deserializes with LazySimpleSerDe instead of DynamicSerDe.

Steven


-Original Message-
From: Zheng Shao [mailto:zs...@facebook.com] 
Sent: Wednesday, September 01, 2010 3:22 AM
To: Steven Wong; hive-dev@hadoop.apache.org; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Hi Steven,

Sorry for the late reply. The email slipped my eye...


This issue was brought up multiple times.  In my opinion, using JSON in 
LazySimpleSerDe (inherited from ColumnsetSerDe, MetadataColumnsetSerDe, 
DynamicSerDe) was a long-time legacy problem that never got fixed.   
LazySimpleSerDe was supposed to do delimited format only.


The cleanest way to do that is to:
1. Get rid of the JSON-related logic in LazySimpleSerDe;
2. Introduce another "DelimitedJSONSerDe" (without deserialization capability) 
that does JSON serialization for complex fields.  (We never have or need 
deserialization for JSON yet)
3. Configure the FetchTask to use the new SerDe by default, and LazySimpleSerDe 
in case it's JDBC.  This is for serialization only.  We might need to have 2 
SerDe fields in FetchTask - one for deserialization the data from file, one for 
serialization the data to stdout/jdbc etc.


I can help review the code (please ping me) if you decide to go down this route.

Zheng
-Original Message-
From: Steven Wong [mailto:sw...@netflix.com] 
Sent: Monday, August 30, 2010 3:46 PM
To: hive-dev@hadoop.apache.org; John Sichi
Cc: Zheng Shao; Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

Any guidance on how I/we should proceed on HIVE-1378 and HIVE-1606?


-Original Message-
From: Steven Wong 
Sent: Friday, August 27, 201

[jira] Commented: (HIVE-1609) Support partition filtering in metastore

2010-09-02 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905769#action_12905769
 ] 

Carl Steinbach commented on HIVE-1609:
--

DynamicSerDe is the component that has a JavaCC dependency. I think 
DynamicSerDe (and TCTLSeparatedProtocol) were deprecated a long time ago. 
Should we try to remove this code?

> Support partition filtering in metastore
> 
>
> Key: HIVE-1609
> URL: https://issues.apache.org/jira/browse/HIVE-1609
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore
>Reporter: Ajay Kidave
> Fix For: 0.7.0
>
> Attachments: hive_1609.patch, hive_1609_2.patch
>
>
> The metastore needs to have support for returning a list of partitions based 
> on user specified filter conditions. This will be useful for tools which need 
> to do partition pruning. Howl is one such use case. The way partition pruning 
> is done during hive query execution need not be changed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1609) Support partition filtering in metastore

2010-09-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905768#action_12905768
 ] 

Namit Jain commented on HIVE-1609:
--

I think we should stick to antlr only - let us not check in javacc

> Support partition filtering in metastore
> 
>
> Key: HIVE-1609
> URL: https://issues.apache.org/jira/browse/HIVE-1609
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore
>Reporter: Ajay Kidave
> Fix For: 0.7.0
>
> Attachments: hive_1609.patch, hive_1609_2.patch
>
>
> The metastore needs to have support for returning a list of partitions based 
> on user specified filter conditions. This will be useful for tools which need 
> to do partition pruning. Howl is one such use case. The way partition pruning 
> is done during hive query execution need not be changed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905767#action_12905767
 ] 

Namit Jain commented on HIVE-1546:
--

Sorry on jumping on this late.

I quickly reviewed http://wiki.apache.org/pig/Howl/HowlCliFuncSpec, and it 
seems like most of the functionality is already
present in hive. So, we need a way to restrict other types of statements - is 
that a fair statement ?

If there is a slight change needed in hive (for some howl behavior), we can add 
it to hive ? 
Why do we need a brand new client ?


> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1580) cleanup ExecDriver.progress

2010-09-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905766#action_12905766
 ] 

Namit Jain commented on HIVE-1580:
--

+1

> cleanup ExecDriver.progress
> ---
>
> Key: HIVE-1580
> URL: https://issues.apache.org/jira/browse/HIVE-1580
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
> Attachments: hive-1580.1.patch
>
>
> a few problems:
> - if a job is retired - then counters cannot be obtained and a stack trace is 
> printed out (from history code). this confuses users
> - too many calls to getCounters. after a job has been detected to be finished 
> - there are quite a few more calls to get the job status and the counters. we 
> need to figure out a way to curtail this - in busy clusters the gap between 
> the job getting finished and the hive client noticing is very perceptible and 
> impacts user experience.
> calls to getCounters are very expensive in 0.20 as they grab a jobtracker 
> global lock (something we have fixed internally at FB)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1580) cleanup ExecDriver.progress

2010-09-02 Thread Joydeep Sen Sarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1580:


Attachment: hive-1580.1.patch

cleanup multiple calls to getCounters (which turns out to be really expensive 
call in JT) and don't print non-fatal stack traces to console.

> cleanup ExecDriver.progress
> ---
>
> Key: HIVE-1580
> URL: https://issues.apache.org/jira/browse/HIVE-1580
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
> Attachments: hive-1580.1.patch
>
>
> a few problems:
> - if a job is retired - then counters cannot be obtained and a stack trace is 
> printed out (from history code). this confuses users
> - too many calls to getCounters. after a job has been detected to be finished 
> - there are quite a few more calls to get the job status and the counters. we 
> need to figure out a way to curtail this - in busy clusters the gap between 
> the job getting finished and the hive client noticing is very perceptible and 
> impacts user experience.
> calls to getCounters are very expensive in 0.20 as they grab a jobtracker 
> global lock (something we have fixed internally at FB)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905757#action_12905757
 ] 

Carl Steinbach commented on HIVE-1546:
--

I gather from Ashutosh's latest patch that you want to do the following:

* Provide your own implementation of HiveSemanticAnalyzerFactory.
* Subclass SemanticAnalyzer
* Subclass DDLSemanticAnalzyer

I looked at the public and protected members in these classes and think
that at a minimum we would have to mark the following classes as limited
private and evolving:

* HiveSemanticAnalyzerFactory
* BaseSemanticAnalyzer
* SemanticAnalyzer
* DDLSemanticAnalyzer
* ASTNode
* HiveParser (i.e. Hive's ANTLR grammar)
* SemanticAnalyzer Context (org.apache.hadoop.hive.ql.Context)
* Task and FetchTask
* QB
* QBParseInfo
* QBMetaData
* QBJoinTree
* CreateTableDesc

So anytime we touch one of these classes we would need to coordinate with the 
Howl folks to make sure we aren't breaking one of their plugins? I don't think 
this is a good tradeoff if the main benefit we can expect is a simpler build 
and release process for Howl.

> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905752#action_12905752
 ] 

Namit Jain commented on HIVE-1546:
--

Would it be possible to do it via a hook ?

Do you want to allow a subset of operations ? The hook is not very advanced 
right now, and you cannot change the query plan etc.
But, it might be good enough for disallowing a class of statements. We can add 
more parameters to the hook if need be.

That way, the change will be completely outside hive, and we will be able to 
use the existing client, but with a limited functionality.

> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

2010-09-02 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905751#action_12905751
 ] 

He Yongqiang commented on HIVE-1610:


Sammy, the only change in TestHiveFileFormatUtils is to remove URI scheme 
checks (1 line change). 
You actually added some lines of code which were removed by HIVE-1510, and this 
is the reason the testcase fails. 

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> --
>
> Key: HIVE-1610
> URL: https://issues.apache.org/jira/browse/HIVE-1610
> Project: Hadoop Hive
>  Issue Type: Bug
> Environment: Hadoop 0.20.2
>Reporter: Sammy Yu
> Attachments: 
> 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 
> 0003-HIVE-1610.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select 
> distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, 
> keywords.universal_rank, keywords.serp_type, keywords.date_indexed, 
> keywords.search_engine_type, keywords.week from keyword_serp_results keywords 
> JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, 
> min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, 
> keywords1.search_engine_type,  keywords1.week, keywords1.rank, 
> dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN 
> (select domain, keyword, search_engine_type, week, max(date_indexed) as 
> max_date_indexed from keyword_serp_results group by 
> domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = 
> dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND 
> keywords1.search_engine_type = dupkeywords1.search_engine_type AND 
> keywords1.week = dupkeywords1.week AND keywords1.date_indexed = 
> dupkeywords1.max_date_indexed) dupkeywords2 group by 
> domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on 
> keywords.keyword = dupkeywords3.keyword AND  keywords.domain = 
> dupkeywords3.domain AND keywords.search_engine_type = 
> dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND 
> keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = 
> dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started 
> getting this error:
> java.io.IOException: cannot find dir = 
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/00_0
>  in 
> partToPartitionInfo: 
> [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.(CombineHiveInputFormat.java:100)
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the 
> changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1609) Support partition filtering in metastore

2010-09-02 Thread Ajay Kidave (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905743#action_12905743
 ] 

Ajay Kidave commented on HIVE-1609:
---

The parser was written in javacc since it is derived from similar functionality 
in Owl. It was decided to reuse the existing parser when the filter 
representation was discussed. If generated code is the issue, I can change the 
build to pull javacc through ivy and not have the generated code checked in (it 
is so currently because that was how it is in serde). Another possibility is we 
can open another JIRA to change the parser implementation to ANTLR. Do let me 
know what would work.

> Support partition filtering in metastore
> 
>
> Key: HIVE-1609
> URL: https://issues.apache.org/jira/browse/HIVE-1609
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore
>Reporter: Ajay Kidave
> Fix For: 0.7.0
>
> Attachments: hive_1609.patch, hive_1609_2.patch
>
>
> The metastore needs to have support for returning a list of partitions based 
> on user specified filter conditions. This will be useful for tools which need 
> to do partition pruning. Howl is one such use case. The way partition pruning 
> is done during hive query execution need not be changed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HIVE-849) .. not supported

2010-09-02 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain resolved HIVE-849.
-

Hadoop Flags: [Reviewed]
  Resolution: Duplicate

OK - sounds good

> .. not supported
> 
>
> Key: HIVE-849
> URL: https://issues.apache.org/jira/browse/HIVE-849
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Namit Jain
>Assignee: Carl Steinbach
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-849) .. not supported

2010-09-02 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905719#action_12905719
 ] 

Carl Steinbach commented on HIVE-849:
-

@Namit: Correct, but this issue is also covered by HIVE-1517, and the comments 
in that ticket provide more details, so I decided to resolve this ticket as a 
duplicate of HIVE-1517.

> .. not supported
> 
>
> Key: HIVE-849
> URL: https://issues.apache.org/jira/browse/HIVE-849
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Namit Jain
>Assignee: Carl Steinbach
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905711#action_12905711
 ] 

Alan Gates commented on HIVE-1546:
--

Using the definitions given in HADOOP-5073, can we call this interface limited 
private and evolving?  We (the Howl team) know it will continue to change, and 
we understand Hive's desire not to make this a public API.  But checking Howl 
code into Hive just muddles things and makes our build and release process 
harder.  

> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (HIVE-849) .. not supported

2010-09-02 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain reopened HIVE-849:
-

  Assignee: Carl Steinbach  (was: He Yongqiang)

@Carl, I think this referred to the ability of selecting a table from database1 
while using database2

> .. not supported
> 
>
> Key: HIVE-849
> URL: https://issues.apache.org/jira/browse/HIVE-849
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Namit Jain
>Assignee: Carl Steinbach
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905707#action_12905707
 ] 

Carl Steinbach commented on HIVE-1546:
--

I'm +1 on the approach outlined by John.

> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905706#action_12905706
 ] 

John Sichi commented on HIVE-1546:
--

That's fine with me if it doesn't drag in unrelated dependencies.  I would vote 
for contrib, with the plugin mechanism remaining the same as Ashutosh has 
defined it, but with the config parameter explicitly defining it as intended 
for internal use only for now.

Ashutosh, could you run this proposal by the Howl team and see if that is 
acceptable?


> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1467) dynamic partitioning should cluster by partitions

2010-09-02 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905700#action_12905700
 ] 

Ning Zhang commented on HIVE-1467:
--

As discussed with Joydeep and Ashish, it seems we should use the "distribute 
by" mechanism rather than "cluster by" to avoid sorting at the reducer side. 
The difference between them is "distribute by" only have MapReduce partition 
columns set to be the Dyanmic partition columns, and "cluster by" will 
additionally set "key columns" as the dynamic partition columns as well.

So I think we can use 2 mode of reducer-side DP with tradeoffs:
  -- distribute by mode: no sorting but reducers have to keep all files open 
during DP insert. Good choice when there are large amount of data passed from 
mappers to reducers.
  -- cluster by mode: sorting by the DP columns, but we can close a DP file 
once FileSinkOperator sees a dfferent DP column value. Good choice when total 
data size is not that large but there are large number of DPs generated.  

> dynamic partitioning should cluster by partitions
> -
>
> Key: HIVE-1467
> URL: https://issues.apache.org/jira/browse/HIVE-1467
> Project: Hadoop Hive
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
>
> (based on internal discussion with Ning). Dynamic partitioning should offer a 
> mode where it clusters data by partition before writing out to each 
> partition. This will reduce number of files. Details:
> 1. always use reducer stage
> 2. mapper sends to reducer based on partitioning column. ie. reducer = 
> f(partition-cols)
> 3. f() can be made somewhat smart to:
>a. spread large partitions across multiple reducers - each mapper can 
> maintain row count seen per partition - and then apply (whenever it sees a 
> new row for a partition): 
>* reducer = (row count / 64k) % numReducers 
>Small partitions always go to one reducer. the larger the partition, 
> the more the reducers. this prevents one reducer becoming bottleneck writing 
> out one partition
>b. this still leaves the issue of very large number of splits. (64K rows 
> from 10K mappers is pretty large). for this one can apply one slight 
> modification:
>* reducer = (mapper-id/1024 + row-count/64k) % numReducers
>ie. - the first 1000 mappers always send the first 64K rows for one 
> partition to the same reducer. the next 1000 send it to the next one. and so 
> on.
> the constants 1024 and 64k are used just as an example. i don't know what the 
> right numbers are. it's also clear that this is a case where we need hadoop 
> to do only partitioning (and no sorting). this will be a useful feature to 
> have in hadoop. that will reduce the overhead due to reducers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905692#action_12905692
 ] 

Carl Steinbach commented on HIVE-1546:
--

What do you think of this option: we check the Howl SemanticAnalyzer into the 
Hive source tree and provide a config option that optionally enables it? This 
gives Howl the features they need without making the SemanticAnalyzer API 
public.


> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1130) Create argmin and argmax

2010-09-02 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1130:
-

Status: Open  (was: Patch Available)

> Create argmin and argmax
> 
>
> Key: HIVE-1130
> URL: https://issues.apache.org/jira/browse/HIVE-1130
> Project: Hadoop Hive
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Zheng Shao
>Assignee: Pierre Huyn
> Fix For: 0.7.0
>
> Attachments: HIVE-1130.1.patch, HIVE-1130.2.patch
>
>
> With HIVE-1128, users can already do what argmax and argmin does.
> But it will be helpful if we provide these functions explicitly so people 
> from maths/stats background can use it more easily.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905687#action_12905687
 ] 

John Sichi commented on HIVE-1546:
--

It's the usual tradeoffs on copy-and-paste vs factoring.  There's a significant 
amount of DDL processing code which can be shared, and that will continue to 
grow as we add new features (e.g. GRANT/REVOKE) which are applicable to both.


> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905686#action_12905686
 ] 

Carl Steinbach commented on HIVE-1546:
--

bq. we've agreed at the high level on the approach of creating Howl as a 
wrapper around Hive

I thought Howl was supposed to be a wrapper around (and replacement for) the 
Hive metastore, not all of Hive.

I think there are clear advantages to Hive and Howl sharing the same metastore 
code as long as they access this facility through the public API, but can't say 
the same for the two projects using the same CLI code if it means allowing 
external projects to depend on loosely defined set of internal APIs. What 
benefits are we hoping to achieve by having Howl and Hive share the same CLI 
code, especially if Howl is only interested in a small part of it? What are the 
drawbacks of instead encouraging the Howl project to copy the CLI code and 
maintain their own version?


> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1609) Support partition filtering in metastore

2010-09-02 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905679#action_12905679
 ] 

John Sichi commented on HIVE-1609:
--

I agree with Carl regarding the parser:  let's move it to ANTLR.  We have too 
much generated code checked into Hive already, and we're trying to move away 
from that.


> Support partition filtering in metastore
> 
>
> Key: HIVE-1609
> URL: https://issues.apache.org/jira/browse/HIVE-1609
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore
>Reporter: Ajay Kidave
> Fix For: 0.7.0
>
> Attachments: hive_1609.patch, hive_1609_2.patch
>
>
> The metastore needs to have support for returning a list of partitions based 
> on user specified filter conditions. This will be useful for tools which need 
> to do partition pruning. Howl is one such use case. The way partition pruning 
> is done during hive query execution need not be changed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Build Crashing on Hive 0.5 Release

2010-09-02 Thread Edward Capriolo
On Thu, Sep 2, 2010 at 5:12 PM, Stephen Watt  wrote:
> Hi Folks
>
> I'm a Hadoop contributor and am presently working to get both Hadoop and
> Hive running on alternate JREs such as Apache Harmony and IBM Java.
>
> I noticed when building and running the functional tests ("clean test
> tar") for the Hive 0.5 release (i.e. not nightly build) , the build
> crashes right after running
> org.apache.hadoop.hive.ql.tool.TestLineageInfo. In addition, the
> TestCLIDriver Test Case fails as well. This is all using SUN JDK 1.60_14.
> I'm running on a SLES 10 system.
>
> This is a little odd, given that this is a release and not a nightly
> build. Although, its not uncommon for me to see Hudson pass tests that
> fail when running locally. Can someone confirm the build works for them?
>
> This is my build script:
>
> #!/bin/sh
>
> # Set Build Dependencies
> set PATH=$PATH:/home/hive/Java-Versions/jdk1.6.0_14/bin/
> export ANT_HOME=/home/hive/Test-Dependencies/apache-ant-1.7.1
> export JAVA_HOME=/home/hive/Java-Versions/jdk1.6.0_14
> export BUILD_DIR=/home/hive/hive-0.5.0-build
> export HIVE_BUILD=$BUILD_DIR/build
> export HIVE_INSTALL=$BUILD_DIR/hive-0.5.0-dev/
> export HIVE_SRC=$HIVE_INSTALL/src
> export PATH=$PATH:$ANT_HOME/bin
>
> # Define Hadoop Version to Use
> HADOOP_VER=0.20.2
>
> # Run Build and Unit Test
> cd $HIVE_SRC
> ant -Dtarget.dir=$HIVE_BUILD -Dhadoop.version=$HADOOP_VER clean test tar >
> $BUILD_DIR/hiveSUN32Build.out
>
>
> Regards
> Steve Watt

I seem to remember. There were some older bugs when specifying the
minor versions of the 20 branch.
can you try:

HADOOP_VER=0.20.0

Rather then:

HADOOP_VER=0.20.2


[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905675#action_12905675
 ] 

John Sichi commented on HIVE-1546:
--

New dependencies:   we don't prevent anyone from using it, but we can Javadoc 
it as unstable.  We can work out the language now in an updated patch since 
there's currently no Javadoc on the factory interface.

Dependencies on AST/ANTLR:  it does make such changes more expensive in terms 
of impact analysis and migration, but it doesn't really prevent us in any way, 
does it?

Given that we've agreed at the high level on the approach of creating Howl as a 
wrapper around Hive (reusing as much as possible of what's already there), can 
you suggest an alternative mechanism that addresses the requirements while 
minimizing the injection of Howl behavior directly into Hive itself?  If it 
were something generic like a bitmask of allowed operations, I could kind of 
see it, but the validation logic is more involved than that (and may become 
even more so over time).  I wasn't able to come up with anything clean on that 
front myself, which is why I suggested the factoring approach to Pradeep 
originally.  Apologies for not getting stuff aired out sooner.


> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build Crashing on Hive 0.5 Release

2010-09-02 Thread Stephen Watt
Hi Folks

I'm a Hadoop contributor and am presently working to get both Hadoop and 
Hive running on alternate JREs such as Apache Harmony and IBM Java. 

I noticed when building and running the functional tests ("clean test 
tar") for the Hive 0.5 release (i.e. not nightly build) , the build 
crashes right after running 
org.apache.hadoop.hive.ql.tool.TestLineageInfo. In addition, the 
TestCLIDriver Test Case fails as well. This is all using SUN JDK 1.60_14. 
I'm running on a SLES 10 system.

This is a little odd, given that this is a release and not a nightly 
build. Although, its not uncommon for me to see Hudson pass tests that 
fail when running locally. Can someone confirm the build works for them?

This is my build script:

#!/bin/sh

# Set Build Dependencies
set PATH=$PATH:/home/hive/Java-Versions/jdk1.6.0_14/bin/
export ANT_HOME=/home/hive/Test-Dependencies/apache-ant-1.7.1
export JAVA_HOME=/home/hive/Java-Versions/jdk1.6.0_14
export BUILD_DIR=/home/hive/hive-0.5.0-build
export HIVE_BUILD=$BUILD_DIR/build
export HIVE_INSTALL=$BUILD_DIR/hive-0.5.0-dev/
export HIVE_SRC=$HIVE_INSTALL/src
export PATH=$PATH:$ANT_HOME/bin

# Define Hadoop Version to Use
HADOOP_VER=0.20.2

# Run Build and Unit Test
cd $HIVE_SRC
ant -Dtarget.dir=$HIVE_BUILD -Dhadoop.version=$HADOOP_VER clean test tar > 
$BUILD_DIR/hiveSUN32Build.out


Regards
Steve Watt 

[jira] Updated: (HIVE-1609) Support partition filtering in metastore

2010-09-02 Thread Ajay Kidave (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay Kidave updated HIVE-1609:
--

  Status: Patch Available  (was: Open)
Release Note: Added support for a new listPartitionsByFilter API in 
HiveMetaStoreClient. This returns the list of partitions matching a specified 
partition filter. The filter supports "=", "!=", ">", "<", ">=", "<=" and 
"LIKE" operations on partition keys of type string. "AND" and "OR" logical 
operations are supported in the filter. So for example, for a table having 
partition keys country and state, the filter can be 'country = "USA" AND (state 
= "CA"  OR state = "AZ")'

> Support partition filtering in metastore
> 
>
> Key: HIVE-1609
> URL: https://issues.apache.org/jira/browse/HIVE-1609
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore
>Reporter: Ajay Kidave
> Fix For: 0.7.0
>
> Attachments: hive_1609.patch, hive_1609_2.patch
>
>
> The metastore needs to have support for returning a list of partitions based 
> on user specified filter conditions. This will be useful for tools which need 
> to do partition pruning. Howl is one such use case. The way partition pruning 
> is done during hive query execution need not be changed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905665#action_12905665
 ] 

Carl Steinbach commented on HIVE-1546:
--

bq. Can we get agreement from the Howl team that even though we're introducing 
this dependency now, we will not let its existence hinder future semantic 
analyzer refactoring within Hive?

What about other projects that use this feature? How do we get them to agree to 
this, or how do we prevent them from using it? The new configuration property 
is documented in hive-default.xml, which implies that it's open to everyone.

bq. one possible refinement would be to limit the public interface to just 
validation (as opposed to full semantic analysis). In that case, we would have 
HiveStmtValidatorFactory producing HiveStmtValidator with just a single method 
validate().

This reduces the scope of the dependency, but doesn't eliminate it. Plugins 
would presumably depend on the structure of the AST that they are trying to 
validate, which in turn would limit our ability to refactor the grammar or to 
replace ANTLR with another parser generator.

> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905656#action_12905656
 ] 

John Sichi commented on HIVE-1546:
--

For the last sentence, I meant "If Howl's CLI customized behavior is going to 
need to influence more than just validation"

> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905655#action_12905655
 ] 

John Sichi commented on HIVE-1546:
--

@Carl:  I understand your concern, but this seemed like the least intrusive 
approach as opposed to continually patching Hive to refine what Howl's CLI 
wants to support at a given point in time (which really has nothing to do with 
Hive).  The override approach allows that behavior to be factored completely 
out into Howl.  A number of our existing extensibility interfaces (e.g. 
StorageHandler) already have similar issues regarding impact from continual 
refactoring, so I expect an across-the-board SPI stabilization effort to be 
required in the future (with corresponding migrations from old to new).  This 
will need to be part of that effort.

@Ashutosh:  I hit the hang you mentioned, so I can retry tests with your latest 
patch.  But let's resolve the approach with Carl first.  In particular, can we 
get agreement from the Howl team that even though we're introducing this 
dependency now, we will not let its existence hinder future semantic analyzer 
refactoring within Hive?  As long as we all stay in frequent communication, we 
can make that work.

@Both:  one possible refinement would be to limit the public interface to just 
validation (as opposed to full semantic analysis).  In that case, we would have 
HiveStmtValidatorFactory producing HiveStmtValidator with just a single method 
validate().  This would also remove the unpleasantness of having a factory 
returning a base class rather than an interface.  However, if CLI is going to 
need to do more than just validation, then this isn't good enough.


> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1609) Support partition filtering in metastore

2010-09-02 Thread Ajay Kidave (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay Kidave updated HIVE-1609:
--

Attachment: hive_1609_2.patch

Thanks for the review Carl. Javacc is already used in the hive serde code, so 
it is not a completely new dependency for hive. Javacc has issues with 
generating proper errors for multi-line inputs, since we are using it for a 
small filter string only, this issue should not be seen. The build approach is 
same as taken in serde, i.e the code is regenerated only if javacc.home is 
defined.

Regarding throwing Unknown[DB|Table]Exception, it would require an extra 
database call to first check whether the database is valid. So I have changed 
it to throw a NoSuchObjectException saying db.table does not exist if the 
getMTable operation fails.

I have attached a patch which addresses the other issues.

> Support partition filtering in metastore
> 
>
> Key: HIVE-1609
> URL: https://issues.apache.org/jira/browse/HIVE-1609
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore
>Reporter: Ajay Kidave
> Fix For: 0.7.0
>
> Attachments: hive_1609.patch, hive_1609_2.patch
>
>
> The metastore needs to have support for returning a list of partitions based 
> on user specified filter conditions. This will be useful for tools which need 
> to do partition pruning. Howl is one such use case. The way partition pruning 
> is done during hive query execution need not be changed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-09-02 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905640#action_12905640
 ] 

Ashutosh Chauhan commented on HIVE-1546:


@Carl,

* Ya, the main motivating use case is to provide an alternate DDL CLI tool 
(hopefully not crippled *smiles*). Reason for that is to enforce certain 
use-cases on DDL commands in Howl CLI. More details on that are here: 
http://wiki.apache.org/pig/Howl/HowlCliFuncSpec If you have questions why we 
are making such decisions in Howl, I will encourage you to post it on howl-dev 
list and we can discuss it there. howl...@yahoogroups.com
* I dont understand what do you mean by making "SemanticAnalyzer a public API". 
This patch is just letting other tools to do some semantic analysis of the 
query and then use Hive to do further processing (if tool chooses to do so). 
Important point here is *other tools*. This in no way enforcing any changes to 
any Hive behavior. Hive can continue to have its own semantic analyzer and do 
any sort of semantic analysis of the query. Hive is making no guarantees to any 
tool.
* Hive doesnt care about INPUTDRIVER and OUTPUTDRIVER and neither this patch is 
asking it to. I dont see any way that its providing any mechanism for defining 
tables in MetaStore that Hive cant read or write to.

@John,
Do you want to make me to any further changes or are we good to go?

> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

2010-09-02 Thread Sammy Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammy Yu updated HIVE-1610:
---

Attachment: 0003-HIVE-1610.patch

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> --
>
> Key: HIVE-1610
> URL: https://issues.apache.org/jira/browse/HIVE-1610
> Project: Hadoop Hive
>  Issue Type: Bug
> Environment: Hadoop 0.20.2
>Reporter: Sammy Yu
> Attachments: 
> 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 
> 0003-HIVE-1610.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select 
> distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, 
> keywords.universal_rank, keywords.serp_type, keywords.date_indexed, 
> keywords.search_engine_type, keywords.week from keyword_serp_results keywords 
> JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, 
> min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, 
> keywords1.search_engine_type,  keywords1.week, keywords1.rank, 
> dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN 
> (select domain, keyword, search_engine_type, week, max(date_indexed) as 
> max_date_indexed from keyword_serp_results group by 
> domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = 
> dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND 
> keywords1.search_engine_type = dupkeywords1.search_engine_type AND 
> keywords1.week = dupkeywords1.week AND keywords1.date_indexed = 
> dupkeywords1.max_date_indexed) dupkeywords2 group by 
> domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on 
> keywords.keyword = dupkeywords3.keyword AND  keywords.domain = 
> dupkeywords3.domain AND keywords.search_engine_type = 
> dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND 
> keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = 
> dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started 
> getting this error:
> java.io.IOException: cannot find dir = 
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/00_0
>  in 
> partToPartitionInfo: 
> [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.(CombineHiveInputFormat.java:100)
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the 
> changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

2010-09-02 Thread Sammy Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905635#action_12905635
 ] 

Sammy Yu commented on HIVE-1610:


Yongqiang, thanks for taking a look at this.

If I take out the URI scheme checks, the original 
TestHiveFileFormatUtils.testGetPartitionDescFromPathRecursively test case fails:

[junit] Running org.apache.hadoop.hive.ql.io.TestHiveFileFormatUtils
[junit] junit.framework.TestListener: tests to run: 2
[junit] junit.framework.TestListener: 
startTest(testGetPartitionDescFromPathRecursively)
[junit] junit.framework.TestListener: 
addFailure(testGetPartitionDescFromPathRecursively, 
hdfs:///tbl/par1/part2/part3 should return null expected: but was:)
[junit] junit.framework.TestListener: 
endTest(testGetPartitionDescFromPathRecursively)
[junit] junit.framework.TestListener: 
startTest(testGetPartitionDescFromPathWithPort)
[junit] junit.framework.TestListener: 
endTest(testGetPartitionDescFromPathWithPort)
[junit] Tests run: 2, Failures: 1, Errors: 0, Time elapsed: 0.091 sec
[junit] Test org.apache.hadoop.hive.ql.io.TestHiveFileFormatUtils FAILED

hdfs:///tbl/par1/part2/part3 should not match any PartitionDesc since the path 
in the map is file:///tbl/par1/part2/part3.  I will attach the svn version of 
the patch shortly.





> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> --
>
> Key: HIVE-1610
> URL: https://issues.apache.org/jira/browse/HIVE-1610
> Project: Hadoop Hive
>  Issue Type: Bug
> Environment: Hadoop 0.20.2
>Reporter: Sammy Yu
> Attachments: 
> 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select 
> distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, 
> keywords.universal_rank, keywords.serp_type, keywords.date_indexed, 
> keywords.search_engine_type, keywords.week from keyword_serp_results keywords 
> JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, 
> min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, 
> keywords1.search_engine_type,  keywords1.week, keywords1.rank, 
> dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN 
> (select domain, keyword, search_engine_type, week, max(date_indexed) as 
> max_date_indexed from keyword_serp_results group by 
> domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = 
> dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND 
> keywords1.search_engine_type = dupkeywords1.search_engine_type AND 
> keywords1.week = dupkeywords1.week AND keywords1.date_indexed = 
> dupkeywords1.max_date_indexed) dupkeywords2 group by 
> domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on 
> keywords.keyword = dupkeywords3.keyword AND  keywords.domain = 
> dupkeywords3.domain AND keywords.search_engine_type = 
> dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND 
> keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = 
> dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started 
> getting this error:
> java.io.IOException: cannot find dir = 
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/00_0
>  in 
> partToPartitionInfo: 
> [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.(CombineHiveInputFormat.java:100)
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveIn

[jira] Commented: (HIVE-1476) Hive's metastore when run as a thrift service creates directories as the service user instead of the real user issuing create table/alter table etc.

2010-09-02 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905622#action_12905622
 ] 

Carl Steinbach commented on HIVE-1476:
--

@Venkatesh: THRIFT-814 covers adding SPNEGO support to Thrift.

> Hive's metastore when run as a thrift service creates directories as the 
> service user instead of the real user issuing create table/alter table etc.
> 
>
> Key: HIVE-1476
> URL: https://issues.apache.org/jira/browse/HIVE-1476
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Pradeep Kamath
> Attachments: HIVE-1476.patch, HIVE-1476.patch.2
>
>
> If the thrift metastore service is running as the user "hive" then all table 
> directories as a result of create table are created as that user rather than 
> the user who actually issued the create table command. This is different 
> semantically from non-thrift mode (i.e. local mode) when clients directly 
> connect to the metastore. In the latter case, directories are created as the 
> real user. The thrift mode should do the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1476) Hive's metastore when run as a thrift service creates directories as the service user instead of the real user issuing create table/alter table etc.

2010-09-02 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905623#action_12905623
 ] 

Carl Steinbach commented on HIVE-1476:
--

Edit: I mean THRIFT-889.

> Hive's metastore when run as a thrift service creates directories as the 
> service user instead of the real user issuing create table/alter table etc.
> 
>
> Key: HIVE-1476
> URL: https://issues.apache.org/jira/browse/HIVE-1476
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Pradeep Kamath
> Attachments: HIVE-1476.patch, HIVE-1476.patch.2
>
>
> If the thrift metastore service is running as the user "hive" then all table 
> directories as a result of create table are created as that user rather than 
> the user who actually issued the create table command. This is different 
> semantically from non-thrift mode (i.e. local mode) when clients directly 
> connect to the metastore. In the latter case, directories are created as the 
> real user. The thrift mode should do the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HIVE-1611) Add alternative search-provider to Hive site

2010-09-02 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi reassigned HIVE-1611:


Assignee: Alex Baranau

> Add alternative search-provider to Hive site
> 
>
> Key: HIVE-1611
> URL: https://issues.apache.org/jira/browse/HIVE-1611
> Project: Hadoop Hive
>  Issue Type: Improvement
>Reporter: Alex Baranau
>Assignee: Alex Baranau
>Priority: Minor
> Attachments: HIVE-1611.patch
>
>
> Use search-hadoop.com service to make available search in Hive sources, MLs, 
> wiki, etc.
> This was initially proposed on user mailing list. The search service was 
> already added in site's skin (common for all Hadoop related projects) before 
> so this issue is about enabling it for Hive. The ultimate goal is to use it 
> at all Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

2010-09-02 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905594#action_12905594
 ] 

He Yongqiang commented on HIVE-1610:


1.
just remove 
{noformat}
&& (dir.toUri().getScheme() == null || dir.toUri().getScheme().trim()
.equals(""))
{noformat}

will make things work.

2. you need to use svn (not git) to generate the patch.

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> --
>
> Key: HIVE-1610
> URL: https://issues.apache.org/jira/browse/HIVE-1610
> Project: Hadoop Hive
>  Issue Type: Bug
> Environment: Hadoop 0.20.2
>Reporter: Sammy Yu
> Attachments: 
> 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select 
> distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, 
> keywords.universal_rank, keywords.serp_type, keywords.date_indexed, 
> keywords.search_engine_type, keywords.week from keyword_serp_results keywords 
> JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, 
> min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, 
> keywords1.search_engine_type,  keywords1.week, keywords1.rank, 
> dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN 
> (select domain, keyword, search_engine_type, week, max(date_indexed) as 
> max_date_indexed from keyword_serp_results group by 
> domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = 
> dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND 
> keywords1.search_engine_type = dupkeywords1.search_engine_type AND 
> keywords1.week = dupkeywords1.week AND keywords1.date_indexed = 
> dupkeywords1.max_date_indexed) dupkeywords2 group by 
> domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on 
> keywords.keyword = dupkeywords3.keyword AND  keywords.domain = 
> dupkeywords3.domain AND keywords.search_engine_type = 
> dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND 
> keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = 
> dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started 
> getting this error:
> java.io.IOException: cannot find dir = 
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/00_0
>  in 
> partToPartitionInfo: 
> [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.(CombineHiveInputFormat.java:100)
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the 
> changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1611) Add alternative search-provider to Hive site

2010-09-02 Thread Alex Baranau (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Baranau updated HIVE-1611:
---

Status: Patch Available  (was: Open)

> Add alternative search-provider to Hive site
> 
>
> Key: HIVE-1611
> URL: https://issues.apache.org/jira/browse/HIVE-1611
> Project: Hadoop Hive
>  Issue Type: Improvement
>Reporter: Alex Baranau
>Priority: Minor
> Attachments: HIVE-1611.patch
>
>
> Use search-hadoop.com service to make available search in Hive sources, MLs, 
> wiki, etc.
> This was initially proposed on user mailing list. The search service was 
> already added in site's skin (common for all Hadoop related projects) before 
> so this issue is about enabling it for Hive. The ultimate goal is to use it 
> at all Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1611) Add alternative search-provider to Hive site

2010-09-02 Thread Alex Baranau (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Baranau updated HIVE-1611:
---

Attachment: HIVE-1611.patch

Attached patch which enables search-hadoop search service for site

> Add alternative search-provider to Hive site
> 
>
> Key: HIVE-1611
> URL: https://issues.apache.org/jira/browse/HIVE-1611
> Project: Hadoop Hive
>  Issue Type: Improvement
>Reporter: Alex Baranau
>Priority: Minor
> Attachments: HIVE-1611.patch
>
>
> Use search-hadoop.com service to make available search in Hive sources, MLs, 
> wiki, etc.
> This was initially proposed on user mailing list. The search service was 
> already added in site's skin (common for all Hadoop related projects) before 
> so this issue is about enabling it for Hive. The ultimate goal is to use it 
> at all Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1611) Add alternative search-provider to Hive site

2010-09-02 Thread Alex Baranau (JIRA)
Add alternative search-provider to Hive site


 Key: HIVE-1611
 URL: https://issues.apache.org/jira/browse/HIVE-1611
 Project: Hadoop Hive
  Issue Type: Improvement
Reporter: Alex Baranau
Priority: Minor


Use search-hadoop.com service to make available search in Hive sources, MLs, 
wiki, etc.
This was initially proposed on user mailing list. The search service was 
already added in site's skin (common for all Hadoop related projects) before so 
this issue is about enabling it for Hive. The ultimate goal is to use it at all 
Hadoop's sub-projects' sites.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1476) Hive's metastore when run as a thrift service creates directories as the service user instead of the real user issuing create table/alter table etc.

2010-09-02 Thread Venkatesh S (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905470#action_12905470
 ] 

Venkatesh S commented on HIVE-1476:
---

@Todd, Thrift over HTTP transport (THRIFT-814) can use kerberos over SPNEGO.

> Hive's metastore when run as a thrift service creates directories as the 
> service user instead of the real user issuing create table/alter table etc.
> 
>
> Key: HIVE-1476
> URL: https://issues.apache.org/jira/browse/HIVE-1476
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Pradeep Kamath
> Attachments: HIVE-1476.patch, HIVE-1476.patch.2
>
>
> If the thrift metastore service is running as the user "hive" then all table 
> directories as a result of create table are created as that user rather than 
> the user who actually issued the create table command. This is different 
> semantically from non-thrift mode (i.e. local mode) when clients directly 
> connect to the metastore. In the latter case, directories are created as the 
> real user. The thrift mode should do the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1539) Concurrent metastore threading problem

2010-09-02 Thread Bennie Schut (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bennie Schut updated HIVE-1539:
---

Attachment: ClassLoaderResolver.patch

Ok still testing it but this is a temporary fix we add our own sync. version of 
the classloader:

Just make sure you add these properties and it should work:

  datanucleus.classLoaderResolverName
  syncloader




> Concurrent metastore threading problem 
> ---
>
> Key: HIVE-1539
> URL: https://issues.apache.org/jira/browse/HIVE-1539
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Bennie Schut
>Assignee: Bennie Schut
> Attachments: ClassLoaderResolver.patch, thread_dump_hanging.txt
>
>
> When running hive as a service and running a high number of queries 
> concurrently I end up with multiple threads running at 100% cpu without any 
> progress.
> Looking at these threads I notice this thread(484e):
> at 
> org.apache.hadoop.hive.metastore.ObjectStore.getMTable(ObjectStore.java:598)
> But on a different thread(63a2):
> at 
> org.apache.hadoop.hive.metastore.model.MStorageDescriptor.jdoReplaceField(MStorageDescriptor.java)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1539) Concurrent metastore threading problem

2010-09-02 Thread Bennie Schut (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905455#action_12905455
 ] 

Bennie Schut commented on HIVE-1539:


JDOClassLoaderResolver doesn't seem thread safe. That's a bit of a surprise. I 
filed a bug with datanucleus: 
http://www.datanucleus.org/servlet/jira/browse/NUCCORE-559
I just made my own threadsafe version of the JDOClassLoaderResolver and am 
loading it to see if that fixes it. Will probably take a few days to be sure it 
got fixed.

> Concurrent metastore threading problem 
> ---
>
> Key: HIVE-1539
> URL: https://issues.apache.org/jira/browse/HIVE-1539
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Bennie Schut
>Assignee: Bennie Schut
> Attachments: thread_dump_hanging.txt
>
>
> When running hive as a service and running a high number of queries 
> concurrently I end up with multiple threads running at 100% cpu without any 
> progress.
> Looking at these threads I notice this thread(484e):
> at 
> org.apache.hadoop.hive.metastore.ObjectStore.getMTable(ObjectStore.java:598)
> But on a different thread(63a2):
> at 
> org.apache.hadoop.hive.metastore.model.MStorageDescriptor.jdoReplaceField(MStorageDescriptor.java)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HIVE-849) .. not supported

2010-09-02 Thread Carl Steinbach (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach resolved HIVE-849.
-

Resolution: Duplicate

> .. not supported
> 
>
> Key: HIVE-849
> URL: https://issues.apache.org/jira/browse/HIVE-849
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Namit Jain
>Assignee: He Yongqiang
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.