Re: Cross join/cartesian product explanation

2015-11-09 Thread Rory Sawyer
Hi Gopal,

Thanks for the speedy response! A follow-up question though: 10Mb input sounds 
like that would work for a map join. I’m having trouble doing a cross join 
between two tables that are too big for a map-side join. Trying to break down 
one table into small enough partitions and then unioning them together seems to 
give comparable performance to a cross join. I’m running Hive on Map Reduce 
right now. Short of moving to a different execution engine, are there any 
performance improvements that can be made to lessen the pain of a cross join? 
Also, could you please elaborate on your comment “The real trouble is that 
MapReduce cannot re-direct data at all (there’s only shuffle edges)"? Thanks!

Best,
Rory



On 11/6/15, 5:09 PM, "Gopal Vijayaraghavan"  wrote:

>
>> Over the last few week I¹ve been trying to use cross joins/cartesian
>>products and was wondering why, exactly, this all gets sent to one
>>reducer. All I¹ve heard or read is that Hive can¹t/doesn¹t parallelize
>>the job. 
>
>The hashcode of the shuffle key is 0, since you need to process every row
>against every key - there's no possibility of dividing up the work.
>
>Tez will actually have a cross-product edge (TEZ-2104), which is a
>distributed cross-product proposal but wasn't picked up in the last Google
>Summer of Code.
>
>The real trouble is that MapReduce cannot re-direct data at all (there's
>only shuffle edges).
>
>> Does anyone have a workaround?
>
>I use a staged partitioned table as a workaround for this, hashed on a
>high nDV key - the goal of the Tez edge is to shuffle the data similarly
>at runtime.
>
>For instance, this python script makes a query with a 19x improvement in
>distribution for a cross-product which generates 50+Gb of data from a
>~10Mb input.
>
>https://gist.github.com/t3rmin4t0r/cfb5bb4f7094d595c1e8
>
>
>It is possible for Hive-Tez to actually generate UNION VertexGroups, but
>it's much more efficient to do this as a edge with a custom EdgeManager,
>since that opens up potentially implementing ThetaJoins in hive using that.
>
>Cheers,
>Gopal
>
>


Fwd: Failed to create HiveMetaStoreClient object in proxy user with Kerberos enabled

2015-11-09 Thread Bing Li
Hi,
I wrote a Java client to talk with HiveMetaStore. (Hive 1.2.0)
But found that it can't new a HiveMetaStoreClient object successfully via a
proxy using in Kerberos env.

===
15/10/13 00:14:38 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]
at
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
==

When I debugging on Hive, I found that the error came from open() method in
HiveMetaStoreClient class.

Around line 406,
 transport = UserGroupInformation.*getCurrentUser()*.doAs(new
PrivilegedExceptionAction() {  *//FAILED, because the current
user doesn't have the cridential*

But it will work if I change above line to
 transport = UserGroupInformation.*getCurrentUser().getRealUser()*.doAs(new
PrivilegedExceptionAction() {

*//PASS*
With Google, *I found*
1. DRILL-3413 fixes this error in Drill side
2. HIVE-4984 (hive metastore should not re-use hadoop proxy configuration)
mentioned related things, but the status is still OPEN

*My Questions:*
1. Have you noticed this issue in HiveMetaStoreClient? If yes, will Hive
plan to fix it?
2. Is the simple change (shown like above) in open() method in
HiveMetaStoreClient enough?


Thank you.
- Bing


thrift.TApplicationException: Invalid method name: 'execute'

2015-11-09 Thread Rajkumar Singh
I am trying to get the query result from thrift api using a java program
which use thrifthive client to execute a query but I am getting exception

org.apache.thrift.TApplicationException: Invalid method name: 'execute'
at
org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
at
org.apache.hadoop.hive.service.ThriftHive$Client.recv_execute(ThriftHive.java:116)
at
org.apache.hadoop.hive.service.ThriftHive$Client.execute(ThriftHive.java:103)
at
com.rajkrrsingh.thrift.test.HiveThriftClientExample.main(HiveThriftClientExample.java:23)

this is the snippet of the program

  TSocket tSocket = new TSocket("hostname",1);
TProtocol tProtocol = new TBinaryProtocol(tSocket);
Client client = new Client(tProtocol);
try {
tSocket.open();
System.out.println(tSocket.isOpen());
client.execute("use default");
client.execute("show tables");
List rs = client.fetchAll();

Any Idea what going wrong here


Re: query orc file by hive

2015-11-09 Thread Elliot West
Hi,

You can create a table and point the location property to the folder
containing your ORC file:

CREATE EXTERNAL TABLE orc_table (
  
)
STORED AS ORC
LOCATION '/hdfs/folder/containing/orc/file'
;


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Thanks - Elliot.

On 9 November 2015 at 09:44, patcharee  wrote:

> Hi,
>
> How can I query an orc file (*.orc) by Hive? This orc file is created by
> other apps, like spark, mr.
>
> Thanks,
> Patcharee
>


query orc file by hive

2015-11-09 Thread patcharee

Hi,

How can I query an orc file (*.orc) by Hive? This orc file is created by 
other apps, like spark, mr.


Thanks,
Patcharee