Re: Cross join/cartesian product explanation
Hi Gopal, Thanks for the speedy response! A follow-up question though: 10Mb input sounds like that would work for a map join. I’m having trouble doing a cross join between two tables that are too big for a map-side join. Trying to break down one table into small enough partitions and then unioning them together seems to give comparable performance to a cross join. I’m running Hive on Map Reduce right now. Short of moving to a different execution engine, are there any performance improvements that can be made to lessen the pain of a cross join? Also, could you please elaborate on your comment “The real trouble is that MapReduce cannot re-direct data at all (there’s only shuffle edges)"? Thanks! Best, Rory On 11/6/15, 5:09 PM, "Gopal Vijayaraghavan"wrote: > >> Over the last few week I¹ve been trying to use cross joins/cartesian >>products and was wondering why, exactly, this all gets sent to one >>reducer. All I¹ve heard or read is that Hive can¹t/doesn¹t parallelize >>the job. > >The hashcode of the shuffle key is 0, since you need to process every row >against every key - there's no possibility of dividing up the work. > >Tez will actually have a cross-product edge (TEZ-2104), which is a >distributed cross-product proposal but wasn't picked up in the last Google >Summer of Code. > >The real trouble is that MapReduce cannot re-direct data at all (there's >only shuffle edges). > >> Does anyone have a workaround? > >I use a staged partitioned table as a workaround for this, hashed on a >high nDV key - the goal of the Tez edge is to shuffle the data similarly >at runtime. > >For instance, this python script makes a query with a 19x improvement in >distribution for a cross-product which generates 50+Gb of data from a >~10Mb input. > >https://gist.github.com/t3rmin4t0r/cfb5bb4f7094d595c1e8 > > >It is possible for Hive-Tez to actually generate UNION VertexGroups, but >it's much more efficient to do this as a edge with a custom EdgeManager, >since that opens up potentially implementing ThetaJoins in hive using that. > >Cheers, >Gopal > >
Fwd: Failed to create HiveMetaStoreClient object in proxy user with Kerberos enabled
Hi, I wrote a Java client to talk with HiveMetaStore. (Hive 1.2.0) But found that it can't new a HiveMetaStoreClient object successfully via a proxy using in Kerberos env. === 15/10/13 00:14:38 ERROR transport.TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271) == When I debugging on Hive, I found that the error came from open() method in HiveMetaStoreClient class. Around line 406, transport = UserGroupInformation.*getCurrentUser()*.doAs(new PrivilegedExceptionAction() { *//FAILED, because the current user doesn't have the cridential* But it will work if I change above line to transport = UserGroupInformation.*getCurrentUser().getRealUser()*.doAs(new PrivilegedExceptionAction() { *//PASS* With Google, *I found* 1. DRILL-3413 fixes this error in Drill side 2. HIVE-4984 (hive metastore should not re-use hadoop proxy configuration) mentioned related things, but the status is still OPEN *My Questions:* 1. Have you noticed this issue in HiveMetaStoreClient? If yes, will Hive plan to fix it? 2. Is the simple change (shown like above) in open() method in HiveMetaStoreClient enough? Thank you. - Bing
thrift.TApplicationException: Invalid method name: 'execute'
I am trying to get the query result from thrift api using a java program which use thrifthive client to execute a query but I am getting exception org.apache.thrift.TApplicationException: Invalid method name: 'execute' at org.apache.thrift.TApplicationException.read(TApplicationException.java:111) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71) at org.apache.hadoop.hive.service.ThriftHive$Client.recv_execute(ThriftHive.java:116) at org.apache.hadoop.hive.service.ThriftHive$Client.execute(ThriftHive.java:103) at com.rajkrrsingh.thrift.test.HiveThriftClientExample.main(HiveThriftClientExample.java:23) this is the snippet of the program TSocket tSocket = new TSocket("hostname",1); TProtocol tProtocol = new TBinaryProtocol(tSocket); Client client = new Client(tProtocol); try { tSocket.open(); System.out.println(tSocket.isOpen()); client.execute("use default"); client.execute("show tables"); List rs = client.fetchAll(); Any Idea what going wrong here
Re: query orc file by hive
Hi, You can create a table and point the location property to the folder containing your ORC file: CREATE EXTERNAL TABLE orc_table ( ) STORED AS ORC LOCATION '/hdfs/folder/containing/orc/file' ; https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable Thanks - Elliot. On 9 November 2015 at 09:44, patchareewrote: > Hi, > > How can I query an orc file (*.orc) by Hive? This orc file is created by > other apps, like spark, mr. > > Thanks, > Patcharee >
query orc file by hive
Hi, How can I query an orc file (*.orc) by Hive? This orc file is created by other apps, like spark, mr. Thanks, Patcharee