[jira] [Created] (HIVE-19256) UDF which shapes the input data according to the specified schema

2018-04-20 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created HIVE-19256:
--

 Summary: UDF which shapes the input data according to the 
specified schema
 Key: HIVE-19256
 URL: https://issues.apache.org/jira/browse/HIVE-19256
 Project: Hive
  Issue Type: New Feature
Reporter: Ratandeep Ratti
Assignee: Ratandeep Ratti


We use this UDF a lot in our org. This UDF takes an object and a Hive schema 
and make sure the output object matches the schema completely. In some respects 
it is similar to {{named
_struct}} UDF which can be used to select columns from a struct, but it is more 
general since it can work not only on structs, but all Hive data types (expect 
union). Also the schema can provide certain valid type conversions (int -> 
double etc)

One scenario where this is quite useful is making sure that the Hive view 
created with a specific schema will have columns which will always match that 
schema. In Hive today when a view is created, new nested columns from the 
underlying table can leak out from the view, even though the user never wanted 
this behavior. Note that this leaking of columns is only for nested columns and 
not for top level columns, so in that regard this behavior of Hive is 
inconsistent.

Sample usage of the UDF
{code}
generic_project(col, "struct>>") // Returning 
data which matches the input schema. Here extra columns which are not part of 
the input will be removed

generic_project(col, "struct") //  If the input column had a struct 
with col a as int . It would type cast 'a' to double.
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-18410) [Performance][Avro] Reading flat Avro tables is very expensive in Hive

2018-01-08 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created HIVE-18410:
--

 Summary: [Performance][Avro] Reading flat Avro tables is very 
expensive in Hive
 Key: HIVE-18410
 URL: https://issues.apache.org/jira/browse/HIVE-18410
 Project: Hive
  Issue Type: Improvement
Reporter: Ratandeep Ratti
Assignee: Ratandeep Ratti


There's a performance penalty when reading flat [no nested fields] Avro tables. 
When reading the same flat dataset in Pig, it takes half the time.  On 
profiling, a lot of time is spent in 
{{AvroDeserializer.deserializeSingleItemNullableUnion()}}. The bulk of the time 
is spent in GenericData.get().resolveUnion(), which calls 
GenericData.getSchemaName(Object datum), which does a lot of instanceof checks. 
 This could be simplified with performance benefits. A approach is described in 
this patch which almost halves the runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17394) AvroSerde is regenerating TypeInfo objects for each nullable Avro field in a row

2017-08-28 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created HIVE-17394:
--

 Summary: AvroSerde is regenerating TypeInfo objects for each 
nullable Avro field in a row
 Key: HIVE-17394
 URL: https://issues.apache.org/jira/browse/HIVE-17394
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Ratandeep Ratti


The following methods in {{AvroDeserializer}} keep regenerating TypeInfo 
objects for every nullable  field in a row.

This is happening in the following methods.

{code}
private Object deserializeNullableUnion(Object datum, Schema fileSchema, Schema 
recordSchema) throws AvroSerdeException {
// elided
line 312:  return worker(datum, fileSchema, newRecordSchema,
SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null));
}
..
private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema 
recordSchema)
// elided
line 357: return worker(datum, currentFileSchema, schema,
  SchemaToTypeInfo.generateTypeInfo(schema, null));
{code}

This is really bad in terms of performance. I'm not sure why didn't we use the 
TypeInfo we already have instead of generating again for each nullable field.  
If you look at the {{worker}} method which calls the method 
{{deserializeNullableUnion}} the typeInfo corresponding to the nullable field 
column is already determined. Not sure why we have to determine that 
information again.

More the cache in SchmaToTypeInfo does not help in nullable Avro records case 
as checking if an Avro record schema object already exists in the cache 
requires traversing the all the fields in the record schema.

I've attached profiling snapshot which shows maximum time is being spent in the 
cache.

One way of fixing this IMO is to make use of the column TypeInfo which is 
already passed in the worker method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-15107) HiveLexer can throw NPE in allowQuoteId

2016-11-01 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created HIVE-15107:
--

 Summary: HiveLexer can throw NPE in allowQuoteId
 Key: HIVE-15107
 URL: https://issues.apache.org/jira/browse/HIVE-15107
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.1.1
Reporter: Ratandeep Ratti
Assignee: Ratandeep Ratti


In HiveLexer.allowQuoteId we reference the HiveConf field, which may be null.  
The configuration field is set in ParseDriver only if the hive.ql.Context 
variable is not null. ParseDriver exposes API such as 
org.apache.hadoop.hive.ql.parse.ParseDriver#parse(java.lang.String) which can 
result in the hive.ql.Context field to be null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-14351) Minor improvement in genUnionPlan method

2016-07-26 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created HIVE-14351:
--

 Summary: Minor improvement in genUnionPlan method
 Key: HIVE-14351
 URL: https://issues.apache.org/jira/browse/HIVE-14351
 Project: Hive
  Issue Type: Improvement
Affects Versions: 2.1.0
Reporter: Ratandeep Ratti
Assignee: Ratandeep Ratti


{{org.apache.hadoop.hive.ql.parse.SemanticAnalyzer#genUnionPlan}} method can 
trip new users reading the code.

Specifically on line 8979
{code}
HashMap leftmap = leftRR.getFieldMap(leftalias);
HashMap rightmap = rightRR.getFieldMap(rightalias);
{code}

These column maps are actually LinkedHashMaps and the code relies on this fact  
when iterating the two union branches in order.  

This was not clear immediately and  left me wondering how is it that traversal 
order is consistent.

I've updated the code with this simple fix.









--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-13115) MetaStore Direct SQL calls fail when the columns schema for a partition is null

2016-02-22 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created HIVE-13115:
--

 Summary: MetaStore Direct SQL calls fail when the columns schema 
for a partition is null
 Key: HIVE-13115
 URL: https://issues.apache.org/jira/browse/HIVE-13115
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 1.2.1
Reporter: Ratandeep Ratti


We are seeing the following exception in our MetaStore logs

{noformat}
2016-02-11 00:00:19,002 DEBUG metastore.MetaStoreDirectSql 
(MetaStoreDirectSql.java:timingTrace(602)) - Direct SQL query in 5.842372ms + 
1.066728ms, the query is [select "PARTITIONS"."PART_ID" from "PARTITIONS"  
inner join "TBLS" on "PART
ITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = ?   inner join 
"DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID"  and "DBS"."NAME" = ?  order by 
"PART_NAME" asc]
2016-02-11 00:00:19,021 ERROR metastore.ObjectStore 
(ObjectStore.java:handleDirectSqlError(2243)) - Direct SQL failed, falling back 
to ORM
MetaException(message:Unexpected null for one of the IDs, SD 6437, column null, 
serde 6437 for a non- view)
at 
org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilterInternal(MetaStoreDirectSql.java:360)
at 
org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitions(MetaStoreDirectSql.java:224)
at 
org.apache.hadoop.hive.metastore.ObjectStore$1.getSqlResult(ObjectStore.java:1563)
at 
org.apache.hadoop.hive.metastore.ObjectStore$1.getSqlResult(ObjectStore.java:1559)
at 
org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2208)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsInternal(ObjectStore.java:1570)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPartitions(ObjectStore.java:1553)
at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108)
at com.sun.proxy.$Proxy5.getPartitions(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions(HiveMetaStore.java:2526)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions.getResult(ThriftHiveMetastore.java:8747)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions.getResult(ThriftHiveMetastore.java:8731)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge20S.java:617)
at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge20S.java:613)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1591)
at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge20S.java:613)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{noformat}

This direct SQL call fails for every {{getPartitions}} call and then falls back 
to ORM.

The query which fails is
{code}
select 
  PARTITIONS.PART_ID, SDS.SD_ID, SDS.CD_ID,
  SERDES.SERDE_ID, PARTITIONS.CREATE_TIME,
  PARTITIONS.LAST_ACCESS_TIME, SDS.INPUT_FORMAT, SDS.IS_COMPRESSED,
  SDS.IS_STOREDASSUBDIRECTORIES, SDS.LOCATION, SDS.NUM_BUCKETS,
  SDS.OUTPUT_FORMAT, SERDES.NAME, SERDES.SLIB 
from PARTITIONS
  left outer join SDS on PARTITIONS.SD_ID = SDS.SD_ID 
  left outer join SERDES on SDS.SERDE_ID = SERDES.SERDE_ID 
  where PART_ID in (  ?  ) order by PART_NAME asc;
{code}

By looking at the source {{MetaStoreDirectSql.java}}, the third column in the 
query ( SDS.CD_ID), the column descriptor ID, is null, which triggers the 
exception. This exception is not thrown from the ORM layer since it is more 
forgiving to the null column descriptor. See ObjectStore.java:1197
{code}
 List mFieldSchemas = msd.getCD() == null ? null : 
msd.getCD().getCols();
{code}

I verified that this exception gets triggered in the first place when we add a 
new partition without setting column level schemas for the partition, using the 
MetaStoreClient API. This exception does not occur when adding 

[jira] [Created] (HIVE-12714) Document and make explicit GenericUDF state serialization features.

2015-12-19 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created HIVE-12714:
--

 Summary: Document and make explicit GenericUDF state serialization 
features.
 Key: HIVE-12714
 URL: https://issues.apache.org/jira/browse/HIVE-12714
 Project: Hive
  Issue Type: New Feature
Reporter: Ratandeep Ratti
Assignee: Ratandeep Ratti


Hi
   GenericUDF has a sort of hidden feature which is not publicized on any 
official Hive wikis. GenericUDF's state is serialized on the client state and 
is reconstructed on the slave nodes, using Kryo, with the required state 
intact. 

Seems like this is a nice feature. We should document this feature with 
shortcomings if any and have an explicit test case in the source code to make 
the contract explicit.

Thought's welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-11878) ClassNotFoundException can possibly occur if multiple jars are registered in Hive

2015-09-18 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created HIVE-11878:
--

 Summary: ClassNotFoundException can possibly  occur if multiple 
jars are registered in Hive
 Key: HIVE-11878
 URL: https://issues.apache.org/jira/browse/HIVE-11878
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 1.2.1
Reporter: Ratandeep Ratti
Assignee: Ratandeep Ratti


When we register a jar on the Hive console. Hive creates a fresh URL 
classloader which includes the path of the current jar to be registered and all 
the jar paths of the parent classloader. The parent classlaoder is the current 
ThreadContextClassLoader. Once the URLClassloader is created Hive sets that as 
the current ThreadContextClassloader.

So if we register multiple jars in Hive, there will be multiple URLClassLoaders 
created, each classloader including the jars from its parent and the one extra 
jar to be registered. The last URLClassLoader created will end up as the 
current ThreadContextClassLoader. (See details: 
org.apache.hadoop.hive.ql.exec.Utilities#addToClassPath)

Now here's an example in which the above strategy can lead to a CNF exception.
We register 2 jars *j1* and *j2* in Hive console. *j1* contains the UDF class 
*c1* and internally relies on class *c2* in jar *j2*. We register *j1* first, 
the URLClassLoader *u1* is created and also set as the 
ThreadContextClassLoader. We register *j2* next, the new URLClassLoader created 
will be *u2* with *u1* as parent and *u2* becomes the new 
ThreadContextClassLoader. Note *u2* includes paths to both jars *j1* and *j2* 
whereas *u1* only has paths to *j1* (For details see: 
org.apache.hadoop.hive.ql.exec.Utilities#addToClassPath).

Now when we register class *c1* under a temporary function in Hive, we load the 
class using {code} class.forName("c1", true, 
Thread.currentThread().getContextClassLoader()) {code} . The 
currentThreadContext class-loader is *u2*, and it has the path to the class 
*c1*, but note that Class-loaders work by delegating to parent class-loader 
first. In this case class *c1* will be found and *defined* by class-loader *u1*.

Now *c1* from jar *j1* has *u1* as its class-loader. If a method (say 
initialize) is called in *c1*, which references the class *c2*, *c2* will not 
be found since the class-loader used to search for *c2* will be *u1* (Since the 
caller's class-loader is used to load a class)


I've added a qtest to explain the problem. Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-11639) hive-exec jar contains within itself other jars

2015-08-25 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created HIVE-11639:
--

 Summary: hive-exec jar contains within itself other jars
 Key: HIVE-11639
 URL: https://issues.apache.org/jira/browse/HIVE-11639
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 1.2.1
Reporter: Ratandeep Ratti


Looking at hive-exec-1.2.1.jar . I see that it contains the following other 
jars 

{code}
jar -tf lib/hive-exec-1.2.1.jar | grep .jar
minlog-1.2.jar
objenesis-1.2.jar
reflectasm-1.07-shaded.jar
{code}

The classes in these jars cannot be used unless we mess around with 
custom-classloaders






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-9851) org.apache.hadoop.hive.serde2.avro.AvroSerializer should use org.apache.avro.generic.GenericData.Array when serializing a list

2015-03-04 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created HIVE-9851:
-

 Summary: org.apache.hadoop.hive.serde2.avro.AvroSerializer should 
use org.apache.avro.generic.GenericData.Array when serializing a list
 Key: HIVE-9851
 URL: https://issues.apache.org/jira/browse/HIVE-9851
 Project: Hive
  Issue Type: Bug
  Components: Hive, Serializers/Deserializers
Reporter: Ratandeep Ratti


Currently AvroSerializer uses java.util.ArrayList for serializing a list in 
Hive.
This causes problems when we need to convert the avro object into some other 
representation say a tuple in Pig.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)