[jira] Updated: (HIVE-1600) need to sort hook input/output lists for test result determinism

2010-08-27 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1600:
-

Status: Patch Available  (was: Open)

Tests passed.

> need to sort hook input/output lists for test result determinism
> 
>
> Key: HIVE-1600
> URL: https://issues.apache.org/jira/browse/HIVE-1600
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: John Sichi
> Fix For: 0.7.0
>
> Attachments: HIVE-1600.1.patch, HIVE-1600.2.patch
>
>
> Begin forwarded message:
> From: Ning Zhang 
> Date: August 26, 2010 2:47:26 PM PDT
> To: John Sichi 
> Cc: "hive-dev@hadoop.apache.org" 
> Subject: Re: failure in load_dyn_part1.q
> Yes I saw this error before but if it does not repro. So it's probably an 
> ordering issue in POSTHOOK. 
> On Aug 26, 2010, at 2:39 PM, John Sichi wrote:
> I'm seeing this failure due to a result diff when running tests on latest 
> trunk:
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-08/hr=12
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=11
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=12
> -POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=11
> -POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=12
> POSTHOOK: Output: defa...@nzhang_part1@ds=2008-04-08/hr=11
> POSTHOOK: Output: defa...@nzhang_part1@ds=2008-04-08/hr=12
> +POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=11
> +POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=12
> Did something change recently?  Or are we missing a Java-level sort on the 
> input/output list for determinism?
> JVS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-27 Thread Carl Steinbach (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach reassigned HIVE-1016:


Assignee: Carl Steinbach  (was: Namit Jain)

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-27 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903748#action_12903748
 ] 

Carl Steinbach commented on HIVE-1016:
--

Yes, I'm working on it. I'll have a patch ready for review by Monday. 
(Reassigned this back to myself).

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-08-27 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903746#action_12903746
 ] 

John Sichi commented on HIVE-1546:
--

More specifically, for the last suggestion, where we currently have

{noformat}
  case HiveParser.TOK_TBLRCFILE:
inputFormat = RCFILE_INPUT;
outputFormat = RCFILE_OUTPUT;
shared.serde = COLUMNAR_SERDE;
storageFormat = true;
{noformat}

instead do

{noformat}
  case HiveParser.TOK_TBLRCFILE:
  processGenericFileFormat("RCFILE");
  break;
  case HiveParser.TOK_FILEFORMAT_GENERIC:
  processGenericFileFormat(child.getChild(0).getText());
  break;

...

void processGenericFileFormat(String formatName) {
Map props = handleGenericFileFormat(formatName);
inputFormat = props.get(Constants.FILE_INPUT_FORMAT);
outputFormat = props.get(Constants.FILE_OUTPUT_FORMAT);
shared.serde = props.get(Constants.META_TABLE_SERDE);
   ...
}
{noformat}

Then in Hive's version of handleGenericFileFormat, make it return the right 
info for SEQUENCEFILE/TEXTFILE/RCFILE.

This is just a sketch, not the real code, but I hope it makes sense.


> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546.patch, hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-08-27 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903738#action_12903738
 ] 

John Sichi commented on HIVE-1546:
--

Yes, that is what I was envisioning.  I think the interface as you've specified 
it looks close to the abstract functionality of the semantic analyzer, which is 
what we want (although we should use interfaces such as Set in preference to 
concrete classes such as HashSet, something which is currently crufty 
throughout Hive).

I agree that this could be too involved for your first patch, and we would 
probably need to evolve the HiveSemanticAnalyzer interface anyway.  So, if 
you're not comfortable going there, let's scale it back to the approach in the 
first patch (with HiveSemanticAnalyzerFactory returning BaseSemanticAnalyzer).

handleGenericFileFormat:  if it's only a hook for Howl, then you can have a 
separate method inside of Howl which returns whatever you want, but wrap it 
with a void method which overrides a void one in Hive (and discards the return 
values).  Or, if the idea is to have Hive use this too, then go ahead and add 
Javadoc specifying exactly what the return map is supposed to contain, and then 
convert the existing SEQUENCEFILE/TEXTFILE/RCFILE cases so they go through the 
generic path.  (But still keep them as reserved words rather than literal 
strings for backwards compatibility.)


> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546.patch, hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1600) need to sort hook input/output lists for test result determinism

2010-08-27 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1600:
-

Attachment: HIVE-1600.2.patch

Still running through tests on this one.

> need to sort hook input/output lists for test result determinism
> 
>
> Key: HIVE-1600
> URL: https://issues.apache.org/jira/browse/HIVE-1600
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: John Sichi
> Fix For: 0.7.0
>
> Attachments: HIVE-1600.1.patch, HIVE-1600.2.patch
>
>
> Begin forwarded message:
> From: Ning Zhang 
> Date: August 26, 2010 2:47:26 PM PDT
> To: John Sichi 
> Cc: "hive-dev@hadoop.apache.org" 
> Subject: Re: failure in load_dyn_part1.q
> Yes I saw this error before but if it does not repro. So it's probably an 
> ordering issue in POSTHOOK. 
> On Aug 26, 2010, at 2:39 PM, John Sichi wrote:
> I'm seeing this failure due to a result diff when running tests on latest 
> trunk:
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-08/hr=12
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=11
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=12
> -POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=11
> -POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=12
> POSTHOOK: Output: defa...@nzhang_part1@ds=2008-04-08/hr=11
> POSTHOOK: Output: defa...@nzhang_part1@ds=2008-04-08/hr=12
> +POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=11
> +POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=12
> Did something change recently?  Or are we missing a Java-level sort on the 
> input/output list for determinism?
> JVS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-27 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain reassigned HIVE-1016:


Assignee: Namit Jain  (was: Carl Steinbach)

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Namit Jain
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

2010-08-27 Thread Carl Steinbach (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated HIVE-1492:
-

Fix Version/s: (was: 0.7.0)
Affects Version/s: (was: 0.7.0)
  Component/s: Query Processor

> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.6.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1411) DataNucleus throws NucleusException if core-3.1.1 JAR appears more than once on CLASSPATH

2010-08-27 Thread Carl Steinbach (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated HIVE-1411:
-

Fix Version/s: (was: 0.7.0)

> DataNucleus throws NucleusException if core-3.1.1 JAR appears more than once 
> on CLASSPATH
> -
>
> Key: HIVE-1411
> URL: https://issues.apache.org/jira/browse/HIVE-1411
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.4.0, 0.4.1, 0.5.0
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
> Fix For: 0.6.0
>
> Attachments: HIVE-1411.patch.txt
>
>
> DataNucleus barfs when the core-3.1.1 JAR file appears more than once on the 
> CLASSPATH:
> {code}
> 2010-03-06 12:33:25,565 ERROR exec.DDLTask 
> (SessionState.java:printError(279)) - FAILED: Error in metadata: 
> javax.jdo.JDOFatalInter 
> nalException: Unexpected exception caught. 
> NestedThrowables: 
> java.lang.reflect.InvocationTargetException 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> javax.jdo.JDOFatalInternalException: Unexpected exception caught. 
> NestedThrowables: 
> java.lang.reflect.InvocationTargetException 
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:258) 
> at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:879) 
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:103) 
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:379) 
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:285) 
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:123) 
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:181) 
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:287) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597) 
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) 
> Caused by: javax.jdo.JDOFatalInternalException: Unexpected exception caught. 
> NestedThrowables: 
> java.lang.reflect.InvocationTargetException 
> at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1186)
> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803) 
> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698) 
> at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:164) 
> at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:181)
> at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:125) 
> at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:104) 
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) 
> at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) 
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:130)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:146)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:118)
>  
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:100)
>  
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:74)
>  
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:783) 
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:794) 
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:252) 
> ... 12 more 
> Caused by: java.lang.reflect.InvocationTargetException 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597) 
> at javax.jdo.JDOHelper$16.run(JDOHelper.java:1956) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.jdo.JDOHelper.invoke(JDOHelper.java:1951) 
> at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1159)
> ... 28 more 
> Caused by: org.datanucleus.exceptions.NucleusException: Plugin (Bundle) 
> "org.eclipse.jdt.core" is already registered. Ensure you do 
> nt have multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/Users/hadop/hadoop-0.20.1+152/build/ivy/lib/Hadoo 
> p/common/core-3.1.1.jar" is already registered, and you are trying to 
> register an identical plugin located at URL "file:/Users/hado 
> p/hadoop-0

[jira] Updated: (HIVE-1531) Make Hive build work with Ivy versions < 2.1.0

2010-08-27 Thread Carl Steinbach (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated HIVE-1531:
-

Fix Version/s: (was: 0.7.0)

> Make Hive build work with Ivy versions < 2.1.0
> --
>
> Key: HIVE-1531
> URL: https://issues.apache.org/jira/browse/HIVE-1531
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Build Infrastructure
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
> Fix For: 0.6.0
>
> Attachments: HIVE-1531.patch.txt
>
>
> Many projects in the Hadoop ecosystem still use Ivy 2.0.0 (including Hadoop 
> and Pig),
> yet Hive requires version 2.1.0. Ordinarily this would not be a problem, but 
> many users
> have a copy of an older version of Ivy in their $ANT_HOME directory, and this 
> copy will
> always get picked up in preference to what the Hive build downloads for 
> itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903708#action_12903708
 ] 

Ning Zhang commented on HIVE-1602:
--

I agree this will be a big change and we are tossing the ideas here. We don't 
have a final plan yet. 

HAR is one idea and definitely we should try it once HIVE-1467 is done. But as 
you said it won't change the # of partitions. Check out some of our tables, 
which has more than 240 partitions each day. With dynamic partition, it is very 
easy to increase it even more. 

Another idea Namit and I were talking about is to store the mapping from the 
list of values {'s', 'm', 'l'} to the actual partition location and store this 
mapping in the metastore. This essentially separates the logical concept of 
partition from the physical storage location (HDFS directories). This could be 
a big change and break some users' assumption who are relying on the reverse of 
the mapping (figuring out partition from the HDFS directory). 

If we decide to go this route, inserting is easy as we get the mapping from 
metastore and decide which directory to write given an output row. Querying is 
a little bit complicated as the partition prunning phase need to figure out 
which physical directory a partition correspond to and get the partition column 
value from the data file itself rather than from the directory name. The 
overhead is of course the partition column value need extra storage in the data 
file. But if we sort based on the partitioning column and with RCFile and 
column level run-length compression (which we have already supported), the 
storage overhead is very small. 

> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1543) set abort in ExecMapper when Hive's record reader got an IOException

2010-08-27 Thread Carl Steinbach (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated HIVE-1543:
-

Component/s: Query Processor

> set abort in ExecMapper when Hive's record reader got an IOException
> 
>
> Key: HIVE-1543
> URL: https://issues.apache.org/jira/browse/HIVE-1543
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.6.0
>
> Attachments: HIVE-1543.1.patch, HIVE-1543.2_branch0.6.patch, 
> HIVE-1543.patch, HIVE-1543_branch0.6.patch
>
>
> When RecordReader got an IOException, ExecMapper does not know and will close 
> the operators as if there is not error. We should catch this exception and 
> avoid writing partial results to HDFS which will be removed later anyways.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1307) More generic and efficient merge method

2010-08-27 Thread Carl Steinbach (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated HIVE-1307:
-

Fix Version/s: 0.6.0
   (was: 0.7.0)
Affects Version/s: (was: 0.6.0)
  Component/s: Query Processor

> More generic and efficient merge method
> ---
>
> Key: HIVE-1307
> URL: https://issues.apache.org/jira/browse/HIVE-1307
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.6.0
>
> Attachments: HIVE-1307.0.patch, HIVE-1307.2.patch, HIVE-1307.3.patch, 
> HIVE-1307.3_java.patch, HIVE-1307.4.patch, HIVE-1307.5.patch, 
> HIVE-1307.6.patch, HIVE-1307.7.patch, HIVE-1307.8.patch, HIVE-1307.9.patch, 
> HIVE-1307.patch, HIVE-1307_2_branch_0.6.patch, HIVE-1307_branch_0.6.patch, 
> HIVE-1307_java_only.patch
>
>
> Currently if hive.merge.mapfiles/mapredfiles=true, a new mapreduce job is 
> create to read the input files and output to one reducer for merging. This MR 
> job is created at compile time and one MR job for one partition. In the case 
> of dynamic partition case, multiple partitions could be created at execution 
> time and generating merging MR job at compile time is impossible. 
> We should generalize the merge framework to allow multiple partitions and 
> most of the time a map-only job should be sufficient if we use 
> CombineHiveInputFormat. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-27 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903707#action_12903707
 ] 

Namit Jain commented on HIVE-1016:
--

Carl, are you working on this ?

We need this pretty urgently - otherwise, I can take this

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1401) Web Interface can ony browse default

2010-08-27 Thread Carl Steinbach (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated HIVE-1401:
-

Fix Version/s: (was: 0.7.0)

> Web Interface can ony browse default
> 
>
> Key: HIVE-1401
> URL: https://issues.apache.org/jira/browse/HIVE-1401
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 0.5.0
>Reporter: Edward Capriolo
>Assignee: Edward Capriolo
> Fix For: 0.6.0
>
> Attachments: HIVE-1401-1-patch.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1594) Typo of hive.merge.size.smallfiles.avgsize prevents change of value

2010-08-27 Thread Carl Steinbach (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated HIVE-1594:
-

Fix Version/s: (was: 0.7.0)
Affects Version/s: (was: 0.6.0)
   (was: 0.7.0)

> Typo of hive.merge.size.smallfiles.avgsize prevents change of value
> ---
>
> Key: HIVE-1594
> URL: https://issues.apache.org/jira/browse/HIVE-1594
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Configuration
>Affects Versions: 0.5.0
>Reporter: Yun Huang Yong
>Assignee: Yun Huang Yong
>Priority: Minor
> Fix For: 0.6.0
>
> Attachments: HIVE-1594-0.5.patch, HIVE-1594.patch
>
>
> The setting is described as hive.merge.size.smallfiles.avgsize, 
> however common/src/java/org/apache/hadoop/hive/conf/HiveConf.java reads it as 
> "hive.merge.smallfiles.avgsize" (note the missing '.size.') so the user's 
> setting has no effect and the value is stuck at the default of 16MB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903692#action_12903692
 ] 

Joydeep Sen Sarma commented on HIVE-1602:
-

yeah. but i have been asking how you are planning to make the grouping of 
partitioning transparent. to me that sounds like a very risky and big change 
and there are no details here.

why would we do this at hive layer given we have HAR already?

i really don't understand why we wouldn't start with hive-1467 and then add HAR 
as an optimization to reduce number of files for small partitions. this doesn't 
address the skew case. it doesn't address the fact that we still have to 
partition by dynamic partitioning columns - and that requires the same 
partition-only map-reduce operator that 1467 requires. at which point - we can 
just do 1467.

what am i missing?

> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903686#action_12903686
 ] 

Ning Zhang commented on HIVE-1602:
--

> insert overwrite table xxx partition (event_class) select a,b,c,event, 
> case(event when 's' then 'sml' when 'm' then 'sml' when 'l' then 'sml' else 
> 'g') from ...

This changed the schema of table xxx. Now the user has to change his query to 
include this event_class in his where clause. no?


> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903672#action_12903672
 ] 

Joydeep Sen Sarma commented on HIVE-1602:
-

> combining small partitions into one large partitions seems to be a natural 
> way.

sure - but i am worried that this is a fundamental change to hive's data model 
and may not be the quickest/safest solution to what is a pretty urgent problem.

also - HAR solves the small files packed into big file already. and it doesn't 
require changes to hive's data model. so in that sense it seems like an easy 
win.

u are still left with the problem of the large partition (skew) problem. this 
doesn't solve that either (assuming u are using reducers).

>  How can the user manually cluster event=s, event=m, event=l into one

insert overwrite table xxx partition (event_class) select a,b,c,event, 
case(event when 's' then 'sml' when 'm' then 'sml' when 'l' then 'sml' else 
'g') from ...

> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903664#action_12903664
 ] 

Ning Zhang commented on HIVE-1602:
--

@joydeep, this is intended to be an open ended discussions about how to tackle 
partition skews. Combining small partitions into one large partitions seems to 
be a natural way. May be the name of list partition is not so obvious, but I 
meant to map a list of values from the DP column to one partition rather than a 
1-to-1 mapping.

HAR is one option and we can keep the partition spec as part of the file name 
so that the actual column is not stored. 

Another way is to store the partition column value in the data file itself if 
the partition corresponds to a list of values. 

> the user can do a one time analysis of the data (for size distribution on 
> different partitioning columns) and then generate the clumping logic manually.

The problem is that there is no way that the user can manually cluster data 
with different partition column values. for example, if event is a DP column 
and you find a couple of large partitions event = {'l', 'g'}, and a 3 small 
partitions event = {'s', 'm', 'l'}. How can the user manually cluster event=s, 
event=m, event=l into one? If there are a lot of these small partitions it 
introduces a lot of problems in HDFS, metastore, and Hive client side. 

> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903662#action_12903662
 ] 

Namit Jain commented on HIVE-1602:
--

I don't think it has anything to do with dynamic partitioning.
It is a scheme to reduce the number of partitions/files in the presence of many 
small partitions

> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

2010-08-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903660#action_12903660
 ] 

Ashutosh Chauhan commented on HIVE-1546:


bq. What I was thinking in my previous comment was as follows:
* only one level of factory
* define new interface HiveSemanticAnalyzer (not an abstract class): copy 
signatures of interface methods from public methods on BaseSemanticAnalyzer; 
add Javadoc (I can help with that if any of them are non-obvious)
* when callers get new analyzer instance from factory, they refer to it via 
the HiveSemanticAnalyzer interface only

Are you envisioning something like this in Driver.java
{code}
HiveSemanticAnalyzer sem = HiveSemanticAnalyzerFactory.get(conf);

// Do semantic analysis and plan generation
sem.analyze(tree, ctx);

// validate the plan
sem.validate();

plan = new QueryPlan(command, sem);

// get the output schema
schema = getSchema(sem, conf);
{code} 

Note where {sem} is used. If so, HiveSemanticAnalyzer interface needs to look 
like this:
{code}
public interface HiveSemanticAnalyzer extends Configurable{

  public void analyze(ASTNode ast, Context ctx) throws SemanticException;

  public void validate() throws SemanticException;

  public List getResultSchema();

  public FetchTask getFetchTask();

  public List> getRootTasks();

  public HashSet getInputs();

  public HashSet getOutputs();

  public LineageInfo getLineageInfo();

  public HashMap getIdToTableNameMap();
}

{code}

This to me look like an awkward interface, which has to expose internal details 
of the very class it is trying to hide.
But even if we make an effort to improve on it and try to come up with a  
better interface, this will mean that I need to touch upon lot of critical code 
paths in BaseSemanticAnalyzer, SemanticAnalyzer and DDLSemanticAnalyzer which I 
am not sure I know enough of them to make changes. If this is not what you were 
envisioning then I have missed something in what you were suggesting.

Assuming we dont go this route, I liked the approach in my second patch better 
then from my first patch. I think it provides better abstractions. But if you 
have different opinion, I am fine with the first one as well.

Regarding handleGenericFileFormat, its the point (a). It cant be void, because 
in Howl where we override this it needs to return information to its caller 
which will work on that information. handleGenericFileFormat() can only make 
use of the token and provide back some information. Caller works on that 
information and modifies the task it has created which usually happens to be of 
type DDLWork. And even Hive will need to go through this same code flow if and 
when it decides to add more FileFormats.

> Ability to plug custom Semantic Analyzers for Hive Grammar
> --
>
> Key: HIVE-1546
> URL: https://issues.apache.org/jira/browse/HIVE-1546
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: hive-1546.patch, hive-1546_2.patch
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903655#action_12903655
 ] 

Joydeep Sen Sarma commented on HIVE-1602:
-

how will this be made transparent from queryability perspective? i think i 
still don't understand the details

i agree - if the user does it themselves they have to duplicate the column. but 
this doesn't seem like a big deal to me (we compress everything anyway and 
partitioning columns are highly compressible since they will be repeated like 
crazy).

my worry is that this change might be a very big one (representing multiple 
partitions in one storage container). it seems to me a much more fundamental 
change than just fixing dynamic partitioning.



> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: Deserializing map column via JDBC (HIVE-1378)

2010-08-27 Thread Steven Wong
A related jira is HIVE-1606 (For a null value in a string column, JDBC driver 
returns the string "NULL"). What happens is the sever-side serde already turns 
the null into "NULL". Both null and "NULL" are serialized as "NULL"; the 
client-side serde has no hope. I bring this jira up to point out that JDBC's 
server side uses a serialization format that appears intended for display 
(human consumption) instead of deserialization. The mixing of non-JSON and JSON 
serializations is perhaps another manifestation.

Also, fixing HIVE-1606 will obviously require a server-side change. Both 
HIVE-1606 and HIVE-1378 (the jira at hand) can share some server-side change, 
if HIVE-1378 ends up changing the sever side too.

Steven


-Original Message-
From: John Sichi [mailto:jsi...@facebook.com] 
Sent: Friday, August 27, 2010 11:29 AM
To: Steven Wong
Cc: Zheng Shao; hive-dev@hadoop.apache.org; Jerome Boulon
Subject: Re: Deserializing map column via JDBC (HIVE-1378)

I don't know enough about the serdes to say whether that's a problem...maybe 
someone else does?  It seems like as long as the JSON form doesn't include the 
delimiter unescaped, it might work?

JVS

On Aug 26, 2010, at 6:29 PM, Steven Wong wrote:

That sounds like it'll work, at least conceptually. But if the row contains 
primitive and non-primitive columns, the row serialization will be a mix of 
non-JSON and JSON serializations, right? Is that a good thing?


From: John Sichi [mailto:jsi...@facebook.com]
Sent: Thursday, August 26, 2010 12:11 PM
To: Steven Wong
Cc: Zheng Shao; hive-dev@hadoop.apache.org; 
Jerome Boulon
Subject: Re: Deserializing map column via JDBC (HIVE-1378)

If you replace DynamicSerDe with LazySimpleSerDe on the JDBC client side, can't 
you then tell it to expect JSON serialization for the maps?  That way you can 
leave the FetchTask server side as is.

JVS

On Aug 24, 2010, at 2:50 PM, Steven Wong wrote:


I got sidetracked for awhile

Looking at client.fetchOne, it is a call to the Hive server, which shows the 
following call stack:

SerDeUtils.getJSONString(Object, ObjectInspector) line: 205
LazySimpleSerDe.serialize(Object, ObjectInspector) line: 420
FetchTask.fetch(ArrayList) line: 130
Driver.getResults(ArrayList) line: 660
HiveServer$HiveServerHandler.fetchOne() line: 238

In other words, FetchTask.mSerde (an instance of LazySimpleSerDe) serializes 
the map column into JSON strings. It's because FetchTask.mSerde has been 
initialized by FetchTask.initialize to do it that way.

It appears that the fix is to initialize FetchTask.mSerde differently to do 
ctrl-serialization instead - presumably for the JDBC use case only and not for 
other use cases of FetchTask. Further, it appears that FetchTask.mSerde will do 
ctrl-serialization if it is initialized (via the properties "columns" and 
"columns.types") with the proper schema.

Are these right? Pointers on how to get the proper schema? (From 
FetchTask.work?) And on how to restrict the change to JDBC only? (I have no 
idea.)

For symmetry, LazySimpleSerDe should be used to do ctrl-deserialization on the 
client side, per Zheng's suggestion.

Steven


From: Zheng Shao [mailto:zs...@facebook.com]
Sent: Monday, August 16, 2010 3:57 PM
To: Steven Wong; hive-dev@hadoop.apache.org
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

I think the call to client.fetchOne should use delimited format, so that 
DynamicSerDe can deserialize it.
This should be a good short-term fix.

Also on a higher level, DynamicSerDe is deprecated.  It will be great to use 
LazySimpleSerDe to handle all serialization/deserializations instead.

Zheng
From: Steven Wong [mailto:sw...@netflix.com]
Sent: Friday, August 13, 2010 7:02 PM
To: Zheng Shao; hive-dev@hadoop.apache.org
Cc: Jerome Boulon
Subject: Deserializing map column via JDBC (HIVE-1378)

Trying to work on HIVE-1378. My first step is to get the Hive JDBC driver to 
return actual values for mapcol in the result set of "select mapcol, bigintcol, 
stringcol from foo", where mapcol is a map column, instead of 
the current behavior of complaining that mapcol's column type is not recognized.

I changed HiveResultSetMetaData.{getColumnType,getColumnTypeName} to recognize 
the map type, but then the returned value for mapcol is always {}, even though 
mapcol does contain some key-value entries. Turns out this is happening in 
HiveQueryResultSet.next:

1.   The call to client.fetchOne returns the string "{"a":"b","x":"y"}   
123 abc".
2.   The serde (DynamicSerDe ds) deserializes the string to the list 
[{},123,"abc"].

The serde cannot correctly deserialize the map because apparently the map is 
not in the serde's expected serialization format. The serde has been 
initialized with TCTLSeparatedProtocol.

Should we make client.fetchOne return a ctrl-separated string? Or should we use 
a different serde/f

[jira] Commented: (HIVE-1604) Patch to allow variables in Hive

2010-08-27 Thread Vaibhav Aggarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903645#action_12903645
 ] 

Vaibhav Aggarwal commented on HIVE-1604:


I think that introducing a new option '-d' is simpler as compared to reusing 
'-hiveconf'.
We have been using this patch at Amazon Elastic Map Reduce for allowing 
variable substitution in Hive.
We just wanted to contribute it back.

Please feel free to use this patch for introducing variables if it meets 
expectations.
Else we can wait for the HIVE 1096 to be committed.

Thanks
Vaibhav

> Patch to allow variables in Hive
> 
>
> Key: HIVE-1604
> URL: https://issues.apache.org/jira/browse/HIVE-1604
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: CLI
>Reporter: Vaibhav Aggarwal
> Attachments: HIVE-1604.patch
>
>
> Patch to Hive which allows command line substitution.
> The patch modifies the Hive command line driver and options processor to 
> support the following arguments:
> hive  [-d key=value] [-define key=value] 
>   -dSubsitution to apply to script
>   -define   Subsitution to apply to script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903642#action_12903642
 ] 

Namit Jain commented on HIVE-1602:
--

For the user to do this today, he will have to duplicate the column.

I think we will end up doing the same internally, but if I understand right, 
the idea is to make it transparent to the user.

> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903630#action_12903630
 ] 

Joydeep Sen Sarma commented on HIVE-1602:
-

yikes. how is this queried afterwards?

the user can do this by doing the transformation namit listed in the select 
clause (on the partitioning column). the user can do a one time analysis of the 
data (for size distribution on different partitioning columns) and then 
generate the clumping logic manually.

because this does not result in queryable data sets - it doesn't seem 
useful/reusable to me.

> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903610#action_12903610
 ] 

Namit Jain commented on HIVE-1602:
--

.. we have to generate one directory per distinct DP column value - no?

That has to change for this to work


> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903608#action_12903608
 ] 

Namit Jain commented on HIVE-1602:
--

To clarify more, this is not list partitioning in the traditional database 
sense.

For a table T with the following columns: c1, c2 and partitioning column p:

The user should be able to specify:

For 

p = p1, partition name = p1
p = p2, partition name = p2
p = p3, partition name = p3
p = p4,p5,p6,p7 partition name = p4_p7
p = p8,p9,..,p100 partition name = p8_p100


But, during a query, the actual values of p must be returned.
(for eg: p4, and not p4_p7)


> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1602) List Partitioning

2010-08-27 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903606#action_12903606
 ] 

Joydeep Sen Sarma commented on HIVE-1602:
-

hmmm - not sure i understand. how can we collapse partitions? we have to 
generate one directory per distinct DP column value - no?

(or are you thinking of jumping straight to har?)

> List Partitioning
> -
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1600) need to sort hook input/output lists for test result determinism

2010-08-27 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903605#action_12903605
 ] 

Namit Jain commented on HIVE-1600:
--

ok

> need to sort hook input/output lists for test result determinism
> 
>
> Key: HIVE-1600
> URL: https://issues.apache.org/jira/browse/HIVE-1600
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: John Sichi
> Fix For: 0.7.0
>
> Attachments: HIVE-1600.1.patch
>
>
> Begin forwarded message:
> From: Ning Zhang 
> Date: August 26, 2010 2:47:26 PM PDT
> To: John Sichi 
> Cc: "hive-dev@hadoop.apache.org" 
> Subject: Re: failure in load_dyn_part1.q
> Yes I saw this error before but if it does not repro. So it's probably an 
> ordering issue in POSTHOOK. 
> On Aug 26, 2010, at 2:39 PM, John Sichi wrote:
> I'm seeing this failure due to a result diff when running tests on latest 
> trunk:
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-08/hr=12
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=11
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=12
> -POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=11
> -POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=12
> POSTHOOK: Output: defa...@nzhang_part1@ds=2008-04-08/hr=11
> POSTHOOK: Output: defa...@nzhang_part1@ds=2008-04-08/hr=12
> +POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=11
> +POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=12
> Did something change recently?  Or are we missing a Java-level sort on the 
> input/output list for determinism?
> JVS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1606) For a null value in a string column, JDBC driver returns the string "NULL"

2010-08-27 Thread Steven Wong (JIRA)
For a null value in a string column, JDBC driver returns the string "NULL"
--

 Key: HIVE-1606
 URL: https://issues.apache.org/jira/browse/HIVE-1606
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Drivers
Affects Versions: 0.7.0
Reporter: Steven Wong


It should return null instead of "NULL".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1605) regression and improvements in handling NULLs in joins

2010-08-27 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903584#action_12903584
 ] 

Ning Zhang commented on HIVE-1605:
--

Came up with a patch along with some other performance improvement in 
SMBMapJoinOperator. Still running tests. 

> regression and improvements in handling NULLs in joins
> --
>
> Key: HIVE-1605
> URL: https://issues.apache.org/jira/browse/HIVE-1605
> Project: Hadoop Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot 
> of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap 
> maintained for each key to remember whether it is NULL. This takes too much 
> memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. 
> This appears in regular MapJoin as well as SMBMapJoin. The code only checks 
> if all the columns are NULL. It should return false in match if any joined 
> value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1605) regression and improvements in handling NULLs in joins

2010-08-27 Thread Ning Zhang (JIRA)
regression and improvements in handling NULLs in joins
--

 Key: HIVE-1605
 URL: https://issues.apache.org/jira/browse/HIVE-1605
 Project: Hadoop Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Ning Zhang


There are regressions in sort-merge map join after HIVE-741. There are a lot of 
OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for 
each key to remember whether it is NULL. This takes too much memory when the 
tables are large. 

A second issu is in handling NULLs if the join keys are more than 1 column. 
This appears in regular MapJoin as well as SMBMapJoin. The code only checks if 
all the columns are NULL. It should return false in match if any joined value 
is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1600) need to sort hook input/output lists for test result determinism

2010-08-27 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903580#action_12903580
 ] 

John Sichi commented on HIVE-1600:
--

Actually it's a pain to change that since a lot of places are already passing 
around HashSet.  So I'm going to leave that for a followup since the point of 
this patch is to fix the sporadic failures which are hindering testing of other 
patches.


> need to sort hook input/output lists for test result determinism
> 
>
> Key: HIVE-1600
> URL: https://issues.apache.org/jira/browse/HIVE-1600
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: John Sichi
> Fix For: 0.7.0
>
> Attachments: HIVE-1600.1.patch
>
>
> Begin forwarded message:
> From: Ning Zhang 
> Date: August 26, 2010 2:47:26 PM PDT
> To: John Sichi 
> Cc: "hive-dev@hadoop.apache.org" 
> Subject: Re: failure in load_dyn_part1.q
> Yes I saw this error before but if it does not repro. So it's probably an 
> ordering issue in POSTHOOK. 
> On Aug 26, 2010, at 2:39 PM, John Sichi wrote:
> I'm seeing this failure due to a result diff when running tests on latest 
> trunk:
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-08/hr=12
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=11
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=12
> -POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=11
> -POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=12
> POSTHOOK: Output: defa...@nzhang_part1@ds=2008-04-08/hr=11
> POSTHOOK: Output: defa...@nzhang_part1@ds=2008-04-08/hr=12
> +POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=11
> +POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=12
> Did something change recently?  Or are we missing a Java-level sort on the 
> input/output list for determinism?
> JVS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1604) Patch to allow variables in Hive

2010-08-27 Thread Vaibhav Aggarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903570#action_12903570
 ] 

Vaibhav Aggarwal commented on HIVE-1604:


Hi

I have submitted a patch which allows users to specify command line variables 
in Hive.
This patch performs variable substitution in commands issued using HIVE CLI.

Sample use:

hive \
  -d SAMPLE=s3://elasticmapreduce/samples/hive-ads \
  -d DATE=2009-04-13-08-05

hive>   add jar ${SAMPLE}/libs/jsonserde.jar ;

hive> create external table impressions (
hive>requestBeginTime string, requestEndTime string, hostname string
hive>  )
hive>  partitioned by (
hive>dt string
hive>  )
hive>  row format 
hive>serde 'com.amazon.elasticmapreduce.JsonSerde'
hive>with serdeproperties ( 
hive>  'paths'='requestBeginTime, requestEndTime, hostname'
hive>)
hive>  location '${SAMPLE}/tables/impressions' ;


> Patch to allow variables in Hive
> 
>
> Key: HIVE-1604
> URL: https://issues.apache.org/jira/browse/HIVE-1604
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: CLI
>Reporter: Vaibhav Aggarwal
> Attachments: HIVE-1604.patch
>
>
> Patch to Hive which allows command line substitution.
> The patch modifies the Hive command line driver and options processor to 
> support the following arguments:
> hive  [-d key=value] [-define key=value] 
>   -dSubsitution to apply to script
>   -define   Subsitution to apply to script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1604) Patch to allow variables in Hive

2010-08-27 Thread Vaibhav Aggarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vaibhav Aggarwal updated HIVE-1604:
---

Attachment: HIVE-1604.patch

> Patch to allow variables in Hive
> 
>
> Key: HIVE-1604
> URL: https://issues.apache.org/jira/browse/HIVE-1604
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: CLI
>Reporter: Vaibhav Aggarwal
> Attachments: HIVE-1604.patch
>
>
> Patch to Hive which allows command line substitution.
> The patch modifies the Hive command line driver and options processor to 
> support the following arguments:
> hive  [-d key=value] [-define key=value] 
>   -dSubsitution to apply to script
>   -define   Subsitution to apply to script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1467) dynamic partitioning should cluster by partitions

2010-08-27 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903565#action_12903565
 ] 

Ning Zhang commented on HIVE-1467:
--

@Joydeep, I filed HIVE-1602 for the list partitioning. This is particularly 
tackling the problem of skew in partitions. 

> dynamic partitioning should cluster by partitions
> -
>
> Key: HIVE-1467
> URL: https://issues.apache.org/jira/browse/HIVE-1467
> Project: Hadoop Hive
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
>
> (based on internal discussion with Ning). Dynamic partitioning should offer a 
> mode where it clusters data by partition before writing out to each 
> partition. This will reduce number of files. Details:
> 1. always use reducer stage
> 2. mapper sends to reducer based on partitioning column. ie. reducer = 
> f(partition-cols)
> 3. f() can be made somewhat smart to:
>a. spread large partitions across multiple reducers - each mapper can 
> maintain row count seen per partition - and then apply (whenever it sees a 
> new row for a partition): 
>* reducer = (row count / 64k) % numReducers 
>Small partitions always go to one reducer. the larger the partition, 
> the more the reducers. this prevents one reducer becoming bottleneck writing 
> out one partition
>b. this still leaves the issue of very large number of splits. (64K rows 
> from 10K mappers is pretty large). for this one can apply one slight 
> modification:
>* reducer = (mapper-id/1024 + row-count/64k) % numReducers
>ie. - the first 1000 mappers always send the first 64K rows for one 
> partition to the same reducer. the next 1000 send it to the next one. and so 
> on.
> the constants 1024 and 64k are used just as an example. i don't know what the 
> right numbers are. it's also clear that this is a case where we need hadoop 
> to do only partitioning (and no sorting). this will be a useful feature to 
> have in hadoop. that will reduce the overhead due to reducers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1604) Patch to allow variables in Hive

2010-08-27 Thread Vaibhav Aggarwal (JIRA)
Patch to allow variables in Hive


 Key: HIVE-1604
 URL: https://issues.apache.org/jira/browse/HIVE-1604
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: CLI
Reporter: Vaibhav Aggarwal


Patch to Hive which allows command line substitution.

The patch modifies the Hive command line driver and options processor to 
support the following arguments:

hive  [-d key=value] [-define key=value] 

  -dSubsitution to apply to script
  -define   Subsitution to apply to script


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1603) support CSV text file format

2010-08-27 Thread Ning Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Zhang updated HIVE-1603:
-

Description: 
Comma Separated Values (CSV) text format are commonly used in exchanging 
relational data between heterogeneous systems. Currently Hive uses TextFile 
format when displaying query results. This could cause confusions when column 
values contain new lines or tabs. A CSVTextFile format could get around this 
problem. This will require a new CSVTextInputFormat, CSVTextOutputFormat, and 
CSVSerDe. 

A proposed use case is like:

{code}
-- exporting a table to CSV files in a directory
hive> set hive.io.output.fileformat=CSVTextFile;
hive> insert overwrite local directory '/tmp/CSVrepos/' select * from S where 
... ;

-- query result in CSV
hive -e 'set hive.io.output.fileformat=CSVTextFile; select * from T;' | 
sql_loader_to_other_systems

-- query CSV files directory from Hive
hive> create table T (...) stored as CSVTextFile;
hive> load data local inpath '/my/CSVfiles' into table T;
hive> select * from T where ...;
{code}

  was:
Comma Separated Values (CSV) text format are commonly used in exchanging 
relational data between heterogeneous systems. Currently Hive uses TextFile 
format when displaying query results. This could cause confusions when column 
values contain new lines or tabs. A CSVTextFile format could get around this 
problem. This will require a new CSVTextInputFormat, CSVTextOutputFormat, and 
CSVSerDe. 

A proposed use case is like:

{code}
-- exporting a table to CSV files in a directory
hive> set hive.io.output.fileformat=CSVTextFile;
hive> insert overwrite local directory '/tmp/CSVrepos/' select * from S where 
... ;

-- query result in CSV
hive -e 'set hive.io.output.fileformat=CSVTextFile; select * from T;' | 
sql_loader_to_other_systems
{code}


> support CSV text file format
> 
>
> Key: HIVE-1603
> URL: https://issues.apache.org/jira/browse/HIVE-1603
> Project: Hadoop Hive
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>
> Comma Separated Values (CSV) text format are commonly used in exchanging 
> relational data between heterogeneous systems. Currently Hive uses TextFile 
> format when displaying query results. This could cause confusions when column 
> values contain new lines or tabs. A CSVTextFile format could get around this 
> problem. This will require a new CSVTextInputFormat, CSVTextOutputFormat, and 
> CSVSerDe. 
> A proposed use case is like:
> {code}
> -- exporting a table to CSV files in a directory
> hive> set hive.io.output.fileformat=CSVTextFile;
> hive> insert overwrite local directory '/tmp/CSVrepos/' select * from S where 
> ... ;
> -- query result in CSV
> hive -e 'set hive.io.output.fileformat=CSVTextFile; select * from T;' | 
> sql_loader_to_other_systems
> -- query CSV files directory from Hive
> hive> create table T (...) stored as CSVTextFile;
> hive> load data local inpath '/my/CSVfiles' into table T;
> hive> select * from T where ...;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1603) support CSV text file format

2010-08-27 Thread Ning Zhang (JIRA)
support CSV text file format


 Key: HIVE-1603
 URL: https://issues.apache.org/jira/browse/HIVE-1603
 Project: Hadoop Hive
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Ning Zhang


Comma Separated Values (CSV) text format are commonly used in exchanging 
relational data between heterogeneous systems. Currently Hive uses TextFile 
format when displaying query results. This could cause confusions when column 
values contain new lines or tabs. A CSVTextFile format could get around this 
problem. This will require a new CSVTextInputFormat, CSVTextOutputFormat, and 
CSVSerDe. 

A proposed use case is like:

{code}
-- exporting a table to CSV files in a directory
hive> set hive.io.output.fileformat=CSVTextFile;
hive> insert overwrite local directory '/tmp/CSVrepos/' select * from S where 
... ;

-- query result in CSV
hive -e 'set hive.io.output.fileformat=CSVTextFile; select * from T;' | 
sql_loader_to_other_systems
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1467) dynamic partitioning should cluster by partitions

2010-08-27 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903562#action_12903562
 ] 

Joydeep Sen Sarma commented on HIVE-1467:
-

@Ning - what about skew?

> dynamic partitioning should cluster by partitions
> -
>
> Key: HIVE-1467
> URL: https://issues.apache.org/jira/browse/HIVE-1467
> Project: Hadoop Hive
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
>
> (based on internal discussion with Ning). Dynamic partitioning should offer a 
> mode where it clusters data by partition before writing out to each 
> partition. This will reduce number of files. Details:
> 1. always use reducer stage
> 2. mapper sends to reducer based on partitioning column. ie. reducer = 
> f(partition-cols)
> 3. f() can be made somewhat smart to:
>a. spread large partitions across multiple reducers - each mapper can 
> maintain row count seen per partition - and then apply (whenever it sees a 
> new row for a partition): 
>* reducer = (row count / 64k) % numReducers 
>Small partitions always go to one reducer. the larger the partition, 
> the more the reducers. this prevents one reducer becoming bottleneck writing 
> out one partition
>b. this still leaves the issue of very large number of splits. (64K rows 
> from 10K mappers is pretty large). for this one can apply one slight 
> modification:
>* reducer = (mapper-id/1024 + row-count/64k) % numReducers
>ie. - the first 1000 mappers always send the first 64K rows for one 
> partition to the same reducer. the next 1000 send it to the next one. and so 
> on.
> the constants 1024 and 64k are used just as an example. i don't know what the 
> right numbers are. it's also clear that this is a case where we need hadoop 
> to do only partitioning (and no sorting). this will be a useful feature to 
> have in hadoop. that will reduce the overhead due to reducers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1600) need to sort hook input/output lists for test result determinism

2010-08-27 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903554#action_12903554
 ] 

John Sichi commented on HIVE-1600:
--

Will do.  I'll keep the sorting during rendering (even though it will be 
redundant) as well since there's no formal contract regarding sorting in the 
hook interface.


> need to sort hook input/output lists for test result determinism
> 
>
> Key: HIVE-1600
> URL: https://issues.apache.org/jira/browse/HIVE-1600
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: John Sichi
> Fix For: 0.7.0
>
> Attachments: HIVE-1600.1.patch
>
>
> Begin forwarded message:
> From: Ning Zhang 
> Date: August 26, 2010 2:47:26 PM PDT
> To: John Sichi 
> Cc: "hive-dev@hadoop.apache.org" 
> Subject: Re: failure in load_dyn_part1.q
> Yes I saw this error before but if it does not repro. So it's probably an 
> ordering issue in POSTHOOK. 
> On Aug 26, 2010, at 2:39 PM, John Sichi wrote:
> I'm seeing this failure due to a result diff when running tests on latest 
> trunk:
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-08/hr=12
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=11
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=12
> -POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=11
> -POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=12
> POSTHOOK: Output: defa...@nzhang_part1@ds=2008-04-08/hr=11
> POSTHOOK: Output: defa...@nzhang_part1@ds=2008-04-08/hr=12
> +POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=11
> +POSTHOOK: Output: defa...@nzhang_part2@ds=2008-12-31/hr=12
> Did something change recently?  Or are we missing a Java-level sort on the 
> input/output list for determinism?
> JVS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1601) Hadoop 0.17 ant test broken by HIVE-1523

2010-08-27 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1601:
-

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Committed.  Thanks Joydeep!


> Hadoop 0.17 ant test broken by HIVE-1523
> 
>
> Key: HIVE-1601
> URL: https://issues.apache.org/jira/browse/HIVE-1601
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.7.0
>Reporter: John Sichi
>Assignee: Joydeep Sen Sarma
> Fix For: 0.7.0
>
> Attachments: 1601.1.patch, ant-contrib-1.0b3.jar
>
>
> compile-test:
>[javac] /data/users/jsichi/open/hive-trunk/build-common.xml:304: warning: 
> 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set 
> to false for repeatable builds
>[javac] Compiling 33 source files to 
> /data/users/jsichi/open/hive-trunk/build/ql/test/classes
> BUILD FAILED
> /data/users/jsichi/open/hive-trunk/build.xml:168: The following error 
> occurred while executing this line:
> /data/users/jsichi/open/hive-trunk/build.xml:105: The following error 
> occurred while executing this line:
> /data/users/jsichi/open/hive-trunk/build-common.xml:304: 
> /data/users/jsichi/open/hive-trunk/build/hadoopcore/hadoop-0.17.2.1/lib/jsp-2.1
>  does not exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1427) Provide metastore schema migration scripts (0.5 -> 0.6)

2010-08-27 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903547#action_12903547
 ] 

John Sichi commented on HIVE-1427:
--

Also need script corresponding to this change from HIVE-675.

-
-  
-
+  
+
+  
+  
+
+  


> Provide metastore schema migration scripts (0.5 -> 0.6)
> ---
>
> Key: HIVE-1427
> URL: https://issues.apache.org/jira/browse/HIVE-1427
> Project: Hadoop Hive
>  Issue Type: Task
>  Components: Metastore
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
> Fix For: 0.6.0
>
>
> At a minimum this ticket covers packaging up example MySQL migration scripts 
> (cumulative across all schema changes from 0.5 to 0.6) and explaining what to 
> do with them in the release notes.
> This is also probably a good point at which to decide and clearly state which 
> Metastore DBs we officially support in production, e.g. do we need to provide 
> migration scripts for Derby?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1602) List Partitioning

2010-08-27 Thread Ning Zhang (JIRA)
List Partitioning
-

 Key: HIVE-1602
 URL: https://issues.apache.org/jira/browse/HIVE-1602
 Project: Hadoop Hive
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Ning Zhang


Dynamic partition inserts create partitions bases on the dynamic partition 
column values. Currently it creates one partition for each distinct DP column 
value. This could result in skews in the created dynamic partitions in that 
some partitions are large but there could be large number of small partitions 
as well. This results in burdens in HDFS as well as metastore. A list 
partitioning scheme that aggregate a number of small partitions into one big 
one is more preferable for skewed partitions. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1467) dynamic partitioning should cluster by partitions

2010-08-27 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903533#action_12903533
 ] 

Ning Zhang commented on HIVE-1467:
--

As discussed with Ashish offline. It seems appropriate to suport list 
partitioning now if we can sort the partition column and distribute the rows to 
the reducer to write. Will open a new JIRA and make comments on there. 

> dynamic partitioning should cluster by partitions
> -
>
> Key: HIVE-1467
> URL: https://issues.apache.org/jira/browse/HIVE-1467
> Project: Hadoop Hive
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
>
> (based on internal discussion with Ning). Dynamic partitioning should offer a 
> mode where it clusters data by partition before writing out to each 
> partition. This will reduce number of files. Details:
> 1. always use reducer stage
> 2. mapper sends to reducer based on partitioning column. ie. reducer = 
> f(partition-cols)
> 3. f() can be made somewhat smart to:
>a. spread large partitions across multiple reducers - each mapper can 
> maintain row count seen per partition - and then apply (whenever it sees a 
> new row for a partition): 
>* reducer = (row count / 64k) % numReducers 
>Small partitions always go to one reducer. the larger the partition, 
> the more the reducers. this prevents one reducer becoming bottleneck writing 
> out one partition
>b. this still leaves the issue of very large number of splits. (64K rows 
> from 10K mappers is pretty large). for this one can apply one slight 
> modification:
>* reducer = (mapper-id/1024 + row-count/64k) % numReducers
>ie. - the first 1000 mappers always send the first 64K rows for one 
> partition to the same reducer. the next 1000 send it to the next one. and so 
> on.
> the constants 1024 and 64k are used just as an example. i don't know what the 
> right numbers are. it's also clear that this is a case where we need hadoop 
> to do only partitioning (and no sorting). this will be a useful feature to 
> have in hadoop. that will reduce the overhead due to reducers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1591) Lock the database also as part of locking a table/partition

2010-08-27 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1591:
-

Affects Version/s: 0.7.0

> Lock the database also as part of locking a table/partition
> ---
>
> Key: HIVE-1591
> URL: https://issues.apache.org/jira/browse/HIVE-1591
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.7.0
>Reporter: Namit Jain
>Assignee: Namit Jain
> Fix For: 0.7.0
>
> Attachments: hive.1591.1.patch
>
>
> Drop database should fail if a table in that database is being queried.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1591) Lock the database also as part of locking a table/partition

2010-08-27 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1591:
-

Status: Open  (was: Patch Available)

> Lock the database also as part of locking a table/partition
> ---
>
> Key: HIVE-1591
> URL: https://issues.apache.org/jira/browse/HIVE-1591
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Namit Jain
> Fix For: 0.7.0
>
> Attachments: hive.1591.1.patch
>
>
> Drop database should fail if a table in that database is being queried.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1591) Lock the database also as part of locking a table/partition

2010-08-27 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903517#action_12903517
 ] 

John Sichi commented on HIVE-1591:
--

I applied this patch and tried the following scenario.

Client 1:  insert overwrite table pokes2 select * from pokes;

Client 2:  show locks;

Client 2 is getting an ArrayIndexOutOfBoundsException:  1.

Without the patch, show locks works fine.

Besides addressing this issue, two other items:

* the cause of the exception is getting swallowed, so it never makes it to 
hive.log, due to the code below in DDLTask.showLocks.  It should be passing e 
along as the cause argument to the 2-arg HiveException constructor so that 
there will be a "Caused By" in the stack dump.

   } catch (Exception e) {
  throw new HiveException(e.toString());
}

* we really need tests for actual concurrency


> Lock the database also as part of locking a table/partition
> ---
>
> Key: HIVE-1591
> URL: https://issues.apache.org/jira/browse/HIVE-1591
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Namit Jain
> Fix For: 0.7.0
>
> Attachments: hive.1591.1.patch
>
>
> Drop database should fail if a table in that database is being queried.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Deserializing map column via JDBC (HIVE-1378)

2010-08-27 Thread John Sichi
I don't know enough about the serdes to say whether that's a problem...maybe 
someone else does?  It seems like as long as the JSON form doesn't include the 
delimiter unescaped, it might work?

JVS

On Aug 26, 2010, at 6:29 PM, Steven Wong wrote:

That sounds like it’ll work, at least conceptually. But if the row contains 
primitive and non-primitive columns, the row serialization will be a mix of 
non-JSON and JSON serializations, right? Is that a good thing?


From: John Sichi [mailto:jsi...@facebook.com]
Sent: Thursday, August 26, 2010 12:11 PM
To: Steven Wong
Cc: Zheng Shao; hive-dev@hadoop.apache.org; 
Jerome Boulon
Subject: Re: Deserializing map column via JDBC (HIVE-1378)

If you replace DynamicSerDe with LazySimpleSerDe on the JDBC client side, can't 
you then tell it to expect JSON serialization for the maps?  That way you can 
leave the FetchTask server side as is.

JVS

On Aug 24, 2010, at 2:50 PM, Steven Wong wrote:


I got sidetracked for awhile

Looking at client.fetchOne, it is a call to the Hive server, which shows the 
following call stack:

SerDeUtils.getJSONString(Object, ObjectInspector) line: 205
LazySimpleSerDe.serialize(Object, ObjectInspector) line: 420
FetchTask.fetch(ArrayList) line: 130
Driver.getResults(ArrayList) line: 660
HiveServer$HiveServerHandler.fetchOne() line: 238

In other words, FetchTask.mSerde (an instance of LazySimpleSerDe) serializes 
the map column into JSON strings. It’s because FetchTask.mSerde has been 
initialized by FetchTask.initialize to do it that way.

It appears that the fix is to initialize FetchTask.mSerde differently to do 
ctrl-serialization instead – presumably for the JDBC use case only and not for 
other use cases of FetchTask. Further, it appears that FetchTask.mSerde will do 
ctrl-serialization if it is initialized (via the properties “columns” and 
“columns.types”) with the proper schema.

Are these right? Pointers on how to get the proper schema? (From 
FetchTask.work?) And on how to restrict the change to JDBC only? (I have no 
idea.)

For symmetry, LazySimpleSerDe should be used to do ctrl-deserialization on the 
client side, per Zheng’s suggestion.

Steven


From: Zheng Shao [mailto:zs...@facebook.com]
Sent: Monday, August 16, 2010 3:57 PM
To: Steven Wong; hive-dev@hadoop.apache.org
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

I think the call to client.fetchOne should use delimited format, so that 
DynamicSerDe can deserialize it.
This should be a good short-term fix.

Also on a higher level, DynamicSerDe is deprecated.  It will be great to use 
LazySimpleSerDe to handle all serialization/deserializations instead.

Zheng
From: Steven Wong [mailto:sw...@netflix.com]
Sent: Friday, August 13, 2010 7:02 PM
To: Zheng Shao; hive-dev@hadoop.apache.org
Cc: Jerome Boulon
Subject: Deserializing map column via JDBC (HIVE-1378)

Trying to work on HIVE-1378. My first step is to get the Hive JDBC driver to 
return actual values for mapcol in the result set of “select mapcol, bigintcol, 
stringcol from foo”, where mapcol is a map column, instead of 
the current behavior of complaining that mapcol’s column type is not recognized.

I changed HiveResultSetMetaData.{getColumnType,getColumnTypeName} to recognize 
the map type, but then the returned value for mapcol is always {}, even though 
mapcol does contain some key-value entries. Turns out this is happening in 
HiveQueryResultSet.next:

1.   The call to client.fetchOne returns the string “{"a":"b","x":"y"}   
123 abc”.
2.   The serde (DynamicSerDe ds) deserializes the string to the list 
[{},123,"abc"].

The serde cannot correctly deserialize the map because apparently the map is 
not in the serde’s expected serialization format. The serde has been 
initialized with TCTLSeparatedProtocol.

Should we make client.fetchOne return a ctrl-separated string? Or should we use 
a different serde/format in HiveQueryResultSet? It seems the first way is 
right; correct me if that’s wrong. And how do we do that?

Thanks.
Steven





[jira] Assigned: (HIVE-1570) referencing an added file by it's name in a transform script does not work in hive local mode

2010-08-27 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi reassigned HIVE-1570:


Assignee: Joydeep Sen Sarma

> referencing an added file by it's name in a transform script does not work in 
> hive local mode
> -
>
> Key: HIVE-1570
> URL: https://issues.apache.org/jira/browse/HIVE-1570
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>
> Yongqiang tried this and it fails in local mode:
> add file ../data/scripts/dumpdata_script.py;
> select count(distinct subq.key) from
> (FROM src MAP src.key USING 'python dumpdata_script.py' AS key WHERE src.key 
> = 10) subq;
> this needs to be fixed because it means we cannot choose local mode 
> automatically in case of transform scripts (since different paths need to be 
> used for cluster vs. local mode execution)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HIVE-1467) dynamic partitioning should cluster by partitions

2010-08-27 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain reassigned HIVE-1467:


Assignee: Namit Jain

> dynamic partitioning should cluster by partitions
> -
>
> Key: HIVE-1467
> URL: https://issues.apache.org/jira/browse/HIVE-1467
> Project: Hadoop Hive
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
>Assignee: Namit Jain
>
> (based on internal discussion with Ning). Dynamic partitioning should offer a 
> mode where it clusters data by partition before writing out to each 
> partition. This will reduce number of files. Details:
> 1. always use reducer stage
> 2. mapper sends to reducer based on partitioning column. ie. reducer = 
> f(partition-cols)
> 3. f() can be made somewhat smart to:
>a. spread large partitions across multiple reducers - each mapper can 
> maintain row count seen per partition - and then apply (whenever it sees a 
> new row for a partition): 
>* reducer = (row count / 64k) % numReducers 
>Small partitions always go to one reducer. the larger the partition, 
> the more the reducers. this prevents one reducer becoming bottleneck writing 
> out one partition
>b. this still leaves the issue of very large number of splits. (64K rows 
> from 10K mappers is pretty large). for this one can apply one slight 
> modification:
>* reducer = (mapper-id/1024 + row-count/64k) % numReducers
>ie. - the first 1000 mappers always send the first 64K rows for one 
> partition to the same reducer. the next 1000 send it to the next one. and so 
> on.
> the constants 1024 and 64k are used just as an example. i don't know what the 
> right numbers are. it's also clear that this is a case where we need hadoop 
> to do only partitioning (and no sorting). this will be a useful feature to 
> have in hadoop. that will reduce the overhead due to reducers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1601) Hadoop 0.17 ant test broken by HIVE-1523

2010-08-27 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903484#action_12903484
 ] 

John Sichi commented on HIVE-1601:
--

+1.  Will commit when tests pass.


> Hadoop 0.17 ant test broken by HIVE-1523
> 
>
> Key: HIVE-1601
> URL: https://issues.apache.org/jira/browse/HIVE-1601
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.7.0
>Reporter: John Sichi
>Assignee: Joydeep Sen Sarma
> Fix For: 0.7.0
>
> Attachments: 1601.1.patch, ant-contrib-1.0b3.jar
>
>
> compile-test:
>[javac] /data/users/jsichi/open/hive-trunk/build-common.xml:304: warning: 
> 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set 
> to false for repeatable builds
>[javac] Compiling 33 source files to 
> /data/users/jsichi/open/hive-trunk/build/ql/test/classes
> BUILD FAILED
> /data/users/jsichi/open/hive-trunk/build.xml:168: The following error 
> occurred while executing this line:
> /data/users/jsichi/open/hive-trunk/build.xml:105: The following error 
> occurred while executing this line:
> /data/users/jsichi/open/hive-trunk/build-common.xml:304: 
> /data/users/jsichi/open/hive-trunk/build/hadoopcore/hadoop-0.17.2.1/lib/jsp-2.1
>  does not exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.




[jira] Updated: (HIVE-1601) Hadoop 0.17 ant test broken by HIVE-1523

2010-08-27 Thread Joydeep Sen Sarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1601:


Attachment: 1601.1.patch
ant-contrib-1.0b3.jar

- fix the jsp include
- don't run minimr in 0.17 - it doesn't work
- added ant-contrib jar (attached as a separate file). very useful for writing 
ant conditions (we simplify a bunch of other stuff with it)

> Hadoop 0.17 ant test broken by HIVE-1523
> 
>
> Key: HIVE-1601
> URL: https://issues.apache.org/jira/browse/HIVE-1601
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.7.0
>Reporter: John Sichi
>Assignee: Joydeep Sen Sarma
> Fix For: 0.7.0
>
> Attachments: 1601.1.patch, ant-contrib-1.0b3.jar
>
>
> compile-test:
>[javac] /data/users/jsichi/open/hive-trunk/build-common.xml:304: warning: 
> 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set 
> to false for repeatable builds
>[javac] Compiling 33 source files to 
> /data/users/jsichi/open/hive-trunk/build/ql/test/classes
> BUILD FAILED
> /data/users/jsichi/open/hive-trunk/build.xml:168: The following error 
> occurred while executing this line:
> /data/users/jsichi/open/hive-trunk/build.xml:105: The following error 
> occurred while executing this line:
> /data/users/jsichi/open/hive-trunk/build-common.xml:304: 
> /data/users/jsichi/open/hive-trunk/build/hadoopcore/hadoop-0.17.2.1/lib/jsp-2.1
>  does not exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1601) Hadoop 0.17 ant test broken by HIVE-1523

2010-08-27 Thread Joydeep Sen Sarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1601:


Status: Patch Available  (was: Open)

> Hadoop 0.17 ant test broken by HIVE-1523
> 
>
> Key: HIVE-1601
> URL: https://issues.apache.org/jira/browse/HIVE-1601
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.7.0
>Reporter: John Sichi
>Assignee: Joydeep Sen Sarma
> Fix For: 0.7.0
>
> Attachments: 1601.1.patch, ant-contrib-1.0b3.jar
>
>
> compile-test:
>[javac] /data/users/jsichi/open/hive-trunk/build-common.xml:304: warning: 
> 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set 
> to false for repeatable builds
>[javac] Compiling 33 source files to 
> /data/users/jsichi/open/hive-trunk/build/ql/test/classes
> BUILD FAILED
> /data/users/jsichi/open/hive-trunk/build.xml:168: The following error 
> occurred while executing this line:
> /data/users/jsichi/open/hive-trunk/build.xml:105: The following error 
> occurred while executing this line:
> /data/users/jsichi/open/hive-trunk/build-common.xml:304: 
> /data/users/jsichi/open/hive-trunk/build/hadoopcore/hadoop-0.17.2.1/lib/jsp-2.1
>  does not exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.