from:"David Schorow \(JIRA\)"

[jira] [Commented] (HIVE-7999) Hive metastore query too long when select * on table with large number of partitions

2014-12-16 Thread David Schorow (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249562#comment-14249562
 ] 

David Schorow commented on HIVE-7999:
-

I believe this issue will be resolved by HIVE-8295.  I suggest this be closed 
as a dup of that issue.  That introduces a fix where the query only selects 
1,000 partitions at a time.


 Hive metastore query too long when select * on table with large number of 
 partitions
 -

 Key: HIVE-7999
 URL: https://issues.apache.org/jira/browse/HIVE-7999
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Reporter: Chu Tong

 When it comes to selecting * on a Hive table with large number of partitions, 
 a query like this:
 SELECT PARTITIONS.PART_ID, SDS.SD_ID, SDS.CD_ID, SERDES.SERDE_ID, 
 PARTITIONS.CREATE_TIME, ... SERDES.SLIB from PARTITIONS LEFT OUTER JOIN SDS 
 ON PARTITIONS.SD_ID = SDS.SD_ID LEFT OUTER JOIN SERDES ON SDS.SERDE_ID = 
 SERDES.SERDE_ID where PART_ID in (1,2,3,4 ... 1000 ...
 is generated and executed on metastore, however, due the query lists all the 
 partitions in it, SQL server is unable to compile/execute such a long query 
 and causing the whole query to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8295) Add batch retrieve partition objects for metastore direct sql

2014-11-11 Thread David Schorow (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207032#comment-14207032
 ] 

David Schorow commented on HIVE-8295:
-

Can we make sure the fix for this Jira also fixes HIVE-7999?  That's a problem 
with SQL Server, where the query is just to long to parse.
{quote}
For a select * on a Hive Table, a query like this:

SELECT PARTITIONS.PART_ID, SDS.SD_ID, SDS.CD_ID, SERDES.SERDE_ID, 
PARTITIONS.CREATE_TIME, ... SERDES.SLIB from PARTITIONS LEFT OUTER JOIN SDS ON 
PARTITIONS.SD_ID = SDS.SD_ID LEFT OUTER JOIN SERDES ON SDS.SERDE_ID = 
SERDES.SERDE_ID where PART_ID in (1,2,3,4 ... 1000 ...

is generated and executed on metastore, however, due the query lists all the 
partitions in it, SQL server is unable to compile/execute such a long query and 
causing the whole query to fail.
{quote}

 Add batch retrieve partition objects for metastore direct sql 
 --

 Key: HIVE-8295
 URL: https://issues.apache.org/jira/browse/HIVE-8295
 Project: Hive
  Issue Type: Bug
Reporter: Selina Zhang
Assignee: Selina Zhang
 Attachments: HIVE-8295.02.patch, HIVE-8295.02.patch, 
 HIVE-8295.03.patch, HIVE-8295.1.patch


 Currently in MetastoreDirectSql partition objects are constructed in a way 
 that fetching partition ids first. However, if the partition ids that match 
 the filter is larger than 1000, direct sql will fail with the following stack 
 trace:
 {code}
 2014-09-29 19:30:02,942 DEBUG [pool-1-thread-1] metastore.MetaStoreDirectSql 
 (MetaStoreDirectSql.java:timingTrace(604)) - Direct SQL query in 122.085893ms 
 + 13.048901ms, the query is [select PARTITIONS.PART_ID from PARTITIONS  
 inner join TBLS on PARTITIONS.TBL_ID = TBLS.TBL_ID and 
 TBLS.TBL_NAME = ?   inner join DBS on TBLS.DB_ID = DBS.DB_ID
   and DBS.NAME = ? inner join PARTITION_KEY_VALS FILTER2 on 
 FILTER2.PART_ID = PARTITIONS.PART_ID and FILTER2.INTEGER_IDX = 2 
 where ((FILTER2.PART_KEY_VAL = ?))]
 2014-09-29 19:30:02,949 ERROR [pool-1-thread-1] metastore.ObjectStore 
 (ObjectStore.java:handleDirectSqlError(2248)) - Direct SQL failed, falling 
 back to ORM
 javax.jdo.JDODataStoreException: Error executing SQL query select 
 PARTITIONS.PART_ID, SDS.SD_ID, SDS.CD_ID, SERDES.SERDE_ID, 
 PARTITIONS.CREATE_TIME, PARTITIONS.LAST_ACCESS_TIME, 
 SDS.INPUT_FORMAT, SDS.IS_COMPRESSED, 
 SDS.IS_STOREDASSUBDIRECTORIES, SDS.LOCATION, SDS.NUM_BUCKETS, 
 SDS.OUTPUT_FORMAT, SERDES.NAME, SERDES.SLIB from PARTITIONS  
 left outer join SDS on PARTITIONS.SD_ID = SDS.SD_ID   left outer 
 join SERDES on SDS.SERDE_ID = SERDES.SERDE_ID where PART_ID in 
 (136,140,143,147,152,156,160,163,167,171,174,180,185,191,196,198,203,208,212,217...
 ) order by PART_NAME asc.
 at 
 org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:422)
 at org.datanucleus.api.jdo.JDOQuery.executeWithArray(JDOQuery.java:321)
 at 
 org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilterInternal(MetaStoreDirectSql.java:331)
 at 
 org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilter(MetaStoreDirectSql.java:211)
 at 
 org.apache.hadoop.hive.metastore.ObjectStore$3.getSqlResult(ObjectStore.java:1920)
 at 
 org.apache.hadoop.hive.metastore.ObjectStore$3.getSqlResult(ObjectStore.java:1914)
 at 
 org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2213)
 at 
 org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:1914)
 at 
 org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExpr(ObjectStore.java:1887)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:98)
 at com.sun.proxy.$Proxy8.getPartitionsByExpr(Unknown Source)
 at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_expr(HiveMetaStore.java:3800)
 at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions_by_expr.getResult(ThriftHiveMetastore.java:9366)
 at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions_by_expr.getResult(ThriftHiveMetastore.java:9350)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge20S.java:617)
 at

[jira] [Commented] (HIVE-8295) Add batch retrieve partition objects for metastore direct sql

2014-11-11 Thread David Schorow (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207093#comment-14207093
 ] 

David Schorow commented on HIVE-8295:
-

On SQL Server, the issue is not about being slow, it is about not being able to 
parse and compile the query, since the query text is to long.  I strongly 
suspect that splitting the query into batches of 1000 will fix that problem as 
well.  I just wanted to make sure we were not limiting this change to just 
Oracle, but are having it be applicable to other DBs as well.

 Add batch retrieve partition objects for metastore direct sql 
 --

 Key: HIVE-8295
 URL: https://issues.apache.org/jira/browse/HIVE-8295
 Project: Hive
  Issue Type: Bug
Reporter: Selina Zhang
Assignee: Selina Zhang
 Attachments: HIVE-8295.02.patch, HIVE-8295.02.patch, 
 HIVE-8295.03.patch, HIVE-8295.1.patch


 Currently in MetastoreDirectSql partition objects are constructed in a way 
 that fetching partition ids first. However, if the partition ids that match 
 the filter is larger than 1000, direct sql will fail with the following stack 
 trace:
 {code}
 2014-09-29 19:30:02,942 DEBUG [pool-1-thread-1] metastore.MetaStoreDirectSql 
 (MetaStoreDirectSql.java:timingTrace(604)) - Direct SQL query in 122.085893ms 
 + 13.048901ms, the query is [select PARTITIONS.PART_ID from PARTITIONS  
 inner join TBLS on PARTITIONS.TBL_ID = TBLS.TBL_ID and 
 TBLS.TBL_NAME = ?   inner join DBS on TBLS.DB_ID = DBS.DB_ID
   and DBS.NAME = ? inner join PARTITION_KEY_VALS FILTER2 on 
 FILTER2.PART_ID = PARTITIONS.PART_ID and FILTER2.INTEGER_IDX = 2 
 where ((FILTER2.PART_KEY_VAL = ?))]
 2014-09-29 19:30:02,949 ERROR [pool-1-thread-1] metastore.ObjectStore 
 (ObjectStore.java:handleDirectSqlError(2248)) - Direct SQL failed, falling 
 back to ORM
 javax.jdo.JDODataStoreException: Error executing SQL query select 
 PARTITIONS.PART_ID, SDS.SD_ID, SDS.CD_ID, SERDES.SERDE_ID, 
 PARTITIONS.CREATE_TIME, PARTITIONS.LAST_ACCESS_TIME, 
 SDS.INPUT_FORMAT, SDS.IS_COMPRESSED, 
 SDS.IS_STOREDASSUBDIRECTORIES, SDS.LOCATION, SDS.NUM_BUCKETS, 
 SDS.OUTPUT_FORMAT, SERDES.NAME, SERDES.SLIB from PARTITIONS  
 left outer join SDS on PARTITIONS.SD_ID = SDS.SD_ID   left outer 
 join SERDES on SDS.SERDE_ID = SERDES.SERDE_ID where PART_ID in 
 (136,140,143,147,152,156,160,163,167,171,174,180,185,191,196,198,203,208,212,217...
 ) order by PART_NAME asc.
 at 
 org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:422)
 at org.datanucleus.api.jdo.JDOQuery.executeWithArray(JDOQuery.java:321)
 at 
 org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilterInternal(MetaStoreDirectSql.java:331)
 at 
 org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilter(MetaStoreDirectSql.java:211)
 at 
 org.apache.hadoop.hive.metastore.ObjectStore$3.getSqlResult(ObjectStore.java:1920)
 at 
 org.apache.hadoop.hive.metastore.ObjectStore$3.getSqlResult(ObjectStore.java:1914)
 at 
 org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2213)
 at 
 org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:1914)
 at 
 org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExpr(ObjectStore.java:1887)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:98)
 at com.sun.proxy.$Proxy8.getPartitionsByExpr(Unknown Source)
 at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_expr(HiveMetaStore.java:3800)
 at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions_by_expr.getResult(ThriftHiveMetastore.java:9366)
 at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions_by_expr.getResult(ThriftHiveMetastore.java:9350)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge20S.java:617)
 at 
 org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge20S.java:613)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at

[jira] [Commented] (HIVE-5020) HCat reading null-key map entries causes NPE

2013-08-11 Thread David Schorow (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736421#comment-13736421
]

David Schorow commented on HIVE-5020:
-

It sounds like b is the simplest and safest solution and would effectively make
ORC work consistently with RC files.

In Data Base semantics, NULL is a special value, so for example, NULL does not
match NULL. Hence I don't know what it means to have a NULL key, or to try to
do a lookup with a NULL key. I think such a lookup should never return a
value, regardless of what is in the map. It may be okay to have NULL values in
a map.

HCat reading null-key map entries causes NPE

Key: HIVE-5020
URL: https://issues.apache.org/jira/browse/HIVE-5020
Project: Hive
Issue Type: Bug
Components: HCatalog
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan

Currently, if someone has a null key in a map, HCatInputFormat will terminate
with an NPE while trying to read it.
{noformat}
java.lang.NullPointerException
at java.lang.String.compareTo(String.java:1167)
at java.lang.String.compareTo(String.java:92)
at java.util.TreeMap.put(TreeMap.java:545)
at
org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
at
org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
at
org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
{noformat}
This is because we use a TreeMap to preserve order of elements in the map
when reading from the underlying storage/serde.
This problem is easily fixed in a number of ways:
a) Switch to HashMap, which allows null keys. That does not preserve order of
keys, which should not be important for map fields, but if we desire that, we
have a solution for that too - LinkedHashMap, which would both retain order
and allow us to insert null keys into the map.
b) Ignore null keyed entries - check if the field we read is null, and if it
is, then ignore that item in the record altogether. This way, HCat is robust
in what it does - it does not terminate with an NPE, and it does not allow
null keys in maps that might be problematic to layers above us that are not
used to seeing nulls as keys in maps.
Why do I bring up the second fix? First, I bring it up because of the way we
discovered this bug. When reading from an RCFile, we do not notice this bug.
If the same query that produced the RCFile instead produces an Orcfile, and
we try reading from it, we see this problem.
RCFile seems to be quietly stripping any null key entries, whereas Orc
retains them. This is why we didn't notice this problem for a long while, and
suddenly, now, we are. Now, if we fix our code to allow nulls in map keys
through to layers above, we expose layers above to this change, which may
then cause them to break. (Technically, this is stretching the case because
we already break now if they care) More importantly, though, we have a case
now, where the same data will be exposed differently if it were stored as orc
or if it were stored as rcfile. And as a layer that is supposed to make
storage invisible to the end user, HCat should attempt to provide some
consistency in how data behaves to the end user.
Secondly, whether or not nulls should be supported as keys in Maps seems to
be almost a religious view. Some people see it from a perspective of a
mapping, which lends itself to a Sure, if we encounter a null, we map to
this other value kind of a view, whereas other people view it from a lookup
index kind of view, which lends itself to a null as a key makes no sense -
What kind of lookup do you expect to perform? kind of view. Both views have
their points, and it makes sense to see if we need to support it.
That said...
There is another important concern at hand here: nulls in map keys might be
due to bad data(corruption or loading error), and by stripping them, we might
be silently hiding that from the user. So silent stripping is bad. This is
an important point that does steer me towards the former approach, of passing
it on to layers above, and standardize on an understanding that null keys in
maps are acceptable data that layers above us have to handle. After that, it
could be taken on as a further consistency fix, to fix RCFile so that it
allows nulls in map keys.
Having gone through this discussion of standardization, another important
question is whether or not there is actually a use-case for null keys in maps
in data. If there isn't, maybe we shouldn't allow writing that in the first
place, and both orc

[jira] [Commented] (HIVE-7999) Hive metastore query too long when select * on table with large number of partitions

[jira] [Commented] (HIVE-8295) Add batch retrieve partition objects for metastore direct sql

[jira] [Commented] (HIVE-8295) Add batch retrieve partition objects for metastore direct sql

[jira] [Commented] (HIVE-5020) HCat reading null-key map entries causes NPE

4 matches

Site Navigation

Mail list logo

Footer information