Re: map side join

2015-04-30 Thread Abe Weinograd
Great info, thanks.  Makes sense on the partition since those files can be
shipped by themselves.  These are "reference" tables, but one happens to be
pretty long.

Thanks,
Abe

On Thu, Apr 30, 2015 at 12:54 PM, Gopal Vijayaraghavan 
wrote:

> Hi,
>
> > Using CDH 5.3 - Hive 0.13.  Does a view help here?  Does how i format
> >the table help in reducing size?
>
> No, a view does not help - they are not materialized and you need hive-1.0
> to have temporary table support.
>
> The only way out is if you only have 1 filter column in the system.
>
> I assume your data is not in ORC (because of CDH), which prevents any
> speedups due to the ORC row-index filters.
>
> In ORC, you can reorganize data to get min-max pruning IO savings by
> reinserting data via ³insert overwrite table  select * from  sort by
> ².
>
> In my point lookup queries, when looking for ~3 rows in 6 billion unique
> values, ORC ends up reading only 54,000 rows into memory thanks to ORC
> indexes, even in MapReduce.
>
> But if your case is much simpler & you have a low-cardinality column (i.e
> <100 unique items), you can use that column as a partition column for your
> table, so that it is pre-filtered during planning time.
>
> Outside of those scenarios, a scalable distributed solution exists in
> Tez¹s broadcast JOIN - you can test Tez using the open-source Apache
> hive-0.13.1 (the one Yahoo uses), because the CDH version had Tez removed
> from the package.
>
> Cheers,
> Gopal
>
>
>


Re: map side join

2015-04-30 Thread Gopal Vijayaraghavan
Hi,

> Using CDH 5.3 - Hive 0.13.  Does a view help here?  Does how i format
>the table help in reducing size?

No, a view does not help - they are not materialized and you need hive-1.0
to have temporary table support.

The only way out is if you only have 1 filter column in the system.

I assume your data is not in ORC (because of CDH), which prevents any
speedups due to the ORC row-index filters.

In ORC, you can reorganize data to get min-max pruning IO savings by
reinserting data via ³insert overwrite table  select * from  sort by
².

In my point lookup queries, when looking for ~3 rows in 6 billion unique
values, ORC ends up reading only 54,000 rows into memory thanks to ORC
indexes, even in MapReduce.

But if your case is much simpler & you have a low-cardinality column (i.e
<100 unique items), you can use that column as a partition column for your
table, so that it is pre-filtered during planning time.

Outside of those scenarios, a scalable distributed solution exists in
Tez¹s broadcast JOIN - you can test Tez using the open-source Apache
hive-0.13.1 (the one Yahoo uses), because the CDH version had Tez removed
from the package.

Cheers,
Gopal




Re: map side join

2015-04-30 Thread Abe Weinograd
Using CDH 5.3 - Hive 0.13.  Does a view help here?  Does how i format the
table help in reducing size?

Abe

On Thu, Apr 30, 2015 at 11:07 AM, Gopal Vijayaraghavan 
wrote:

> Hi,
>
> > its submitting the whole table to the job.  if I use a view with the
> >filter
> > baked in, will that help?  I don't want to have to jack up the JVM for
> >the
> > client/HiveServer2 to accommodate the full table.
>
> Which hive version are you using?
>
> If you¹re on a recent version like hive-1.0, this should be a map-reduce
> only problem.
>
> The LocalTask in map-reduce will download the entire table to the local
> task for processing on the HiveServer2 gateway nodes.
>
> Tez has a broadcast edge designed to fix exactly this sort of scalability
> problem (i.e HiveServer2 machines dying of CPU).
>
> Cheers,
> Gopal
>
>
>


Re: map side join

2015-04-30 Thread Gopal Vijayaraghavan
Hi,

> its submitting the whole table to the job.  if I use a view with the
>filter
> baked in, will that help?  I don't want to have to jack up the JVM for
>the
> client/HiveServer2 to accommodate the full table.

Which hive version are you using?

If you¹re on a recent version like hive-1.0, this should be a map-reduce
only problem.

The LocalTask in map-reduce will download the entire table to the local
task for processing on the HiveServer2 gateway nodes.

Tez has a broadcast edge designed to fix exactly this sort of scalability
problem (i.e HiveServer2 machines dying of CPU).

Cheers,
Gopal




map side join

2015-04-30 Thread Abe Weinograd
Hi,

I am doing a few map side joins in one query to load an user facing ORC
table in order to denormalize.

Two of the tables I am joining too are pretty large.  I am
setting hive.auto.convert.join.noconditionaltask.size pretty high.
However, the join it self filters on those two tables, but it seems like
its submitting the whole table to the job.  if I use a view with the filter
baked in, will that help?  I don't want to have to jack up the JVM for the
client/HiveServer2 to accommodate the full table.

Thanks,
Abe


map-side join fails when a serialized table contains arrays

2015-03-02 Thread Makoto Yui
Hi,

I got the attached error on a map-side join where a serialized table
contains an array column.

When setting map-side join off via setting
hive.mapjoin.optimized.hashtable=false, exceptions do not occur.

It seems that a wrong ObjectInspector was set at
CommonJoinOperator#initializeOp.

I am using Hive 1.0.0 (Tez 0.6) on Hadoop 2.6.0.

I found a similar report at
http://stackoverflow.com/questions/28606244/issues-upgrading-to-hdinsight-3-2-hive-0-14-0-tez-0-5-2


Is this a known issue/bug?

Thanks,
Makoto


task:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error
while processing row {"gid":1,"userid":4422,"movieid":1213,"rating":5}
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:186)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error
while processing row {"gid":1,"userid":4422,"movieid":1213,"rating":5}
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:294)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:163)
... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive
Runtime Error while processing row
{"gid":1,"userid":4422,"movieid":1213,"rating":5}
at
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:503)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:83)
... 17 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected
exception: Unexpected exception:
org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray cannot be cast
to [Ljava.lang.Object;
at
org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:311)
at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at
org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:120)
at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
at
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
at
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:493)
... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected
exception: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray
cannot be cast to [Ljava.lang.Object;
at
org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:311)
at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:638)
at
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:670)
at
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(C

Map side join failed when setting hive.optimize.cp to false

2014-08-05 Thread Shangzhong zhu
Hive version 0.12.0

To enable Map side join, we enable:

set
hive.auto.convert.join=true;

set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size = 12800;


However, when we also set: hive.optimize.cp=false, Map side join will fail
with following error:

hadoop@node76-99:/mnt/szhu/queries$ hive -f test-mapjoin.q

Logging initialized using configuration in
jar:file:/home/hadoop/hive-0.12.0/lib/hive-common-0.12.0.jar!/hive-log4j.properties
Hive history
file=/tmp/hadoop/hive_job_log_26f6c6bb-21ab-4e33-851c-3c5d3db63bff_166055775.txt
Total MapReduce jobs = 1
Execution log at: /tmp/hadoop/.log
2014-08-06 12:09:02Starting to launch local task to process map
join;maximum memory = 932118528
Execution failed with exit status: 2
Obtaining error information

Task failed!
Task ID:
  Stage-5

Logs:

/tmp/hadoop/hive.log
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask


Re: Issue while inserting data in the hive table using map side join

2014-04-23 Thread Db-Blog
Hi Anirudh,

Below are some links depicting the problem MIGHT BE related to data nodes.  
Please go thru the same and let us know if it was useful. 
1. http://hansmire.tumblr.com
2. http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

Hive Experts- Kindly share your suggestions/findings on the same. 

Thanks,
Saurabh

> On 24-Apr-2014, at 1:08 am, anirudh kala  wrote:
> 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException):


Issue while inserting data in the hive table using map side join

2014-04-23 Thread anirudh kala
Hi ,


I am running a map side in Hive , the mapred job continues to around 20 %
and then terminates with the following error:

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/tmp/hive-beeswax-admin/hive_2014-04-23_11-47-03_990_7581175268716346013-1/_task_tmp.-ext-10002/_tmp.02_0
could only be replicated to 0 nodes instead of minReplication (=1).
There are 17 datanode(s) running and no node(s) are excluded in this
operation.


This job is running 52 mappers . The source data is roughly 7 GB but
the the target contains around 35 GB of data . The source data gets
bumped up in the map side join.

This is a 20 node cluster with 8 GB ram on each machine.


-- 
thanks
Anirudh
Visit me at :
www.anirudhkala.in


Re: Map-side join memory limit is too low

2014-02-03 Thread Lefty Leverenz
Searching the JIRA for HADOOP_HEAPSIZE turned up this new ticket (and
related ones mentioned in the comments):
HADOOP-10245
:

The Hadoop command line scripts (hadoop.sh or hadoop.cmd) will call java
> with "-Xmx" options twice. The impact is that any user defined
> HADOOP_HEAP_SIZE env variable will take no effect because it is overwritten
> by the second "-Xmx" option.
>

-- Lefty


On Sun, Feb 2, 2014 at 8:20 PM, Navis류승우  wrote:

> try "set hive.mapred.local.mem=7000" or add it to hive-site.xml instead of
> modifying hive-env.sh
>
> HADOOP_HEAPSIZE is not in use. Should fix documentation of it.
>
> Thanks,
> Navis
>
>
> 2014-01-31 Avrilia Floratou :
>
> Hi,
>> I'm running hive 0.12 on yarn and I'm trying to convert a common join
>> into a map join. My map join fails
>> and from the logs I can see that the memory limit is very low:
>>
>>  Starting to launch local task to process map join;  maximum memory =
>> 514523136
>>
>> How can I increase the maximum memory?
>> I've set the HADOOP_HEAP_SIZE at 7GB in hadoop-env.sh and hive-env.sh but
>> that didn't help.
>> Also the nodemanager runs with 7GB heap size.
>>
>> Is there anything else I can do to increase this value?
>>
>> Thanks,
>> Avrilia
>>
>
>


Re: Map-side join memory limit is too low

2014-02-02 Thread Navis류승우
try "set hive.mapred.local.mem=7000" or add it to hive-site.xml instead of
modifying hive-env.sh

HADOOP_HEAPSIZE is not in use. Should fix documentation of it.

Thanks,
Navis


2014-01-31 Avrilia Floratou :

> Hi,
> I'm running hive 0.12 on yarn and I'm trying to convert a common join into
> a map join. My map join fails
> and from the logs I can see that the memory limit is very low:
>
>  Starting to launch local task to process map join;  maximum memory =
> 514523136
>
> How can I increase the maximum memory?
> I've set the HADOOP_HEAP_SIZE at 7GB in hadoop-env.sh and hive-env.sh but
> that didn't help.
> Also the nodemanager runs with 7GB heap size.
>
> Is there anything else I can do to increase this value?
>
> Thanks,
> Avrilia
>


Map-side join memory limit is too low

2014-01-31 Thread Avrilia Floratou
Hi,
I'm running hive 0.12 on yarn and I'm trying to convert a common join into
a map join. My map join fails
and from the logs I can see that the memory limit is very low:

 Starting to launch local task to process map join;  maximum memory =
514523136

How can I increase the maximum memory?
I've set the HADOOP_HEAP_SIZE at 7GB in hadoop-env.sh and hive-env.sh but
that didn't help.
Also the nodemanager runs with 7GB heap size.

Is there anything else I can do to increase this value?

Thanks,
Avrilia


ClassCastException during reduce-side join, but not map-side join

2013-01-17 Thread Anthony Urso
I am getting an exception when joining two tables with Amazon's Hive
0.8.1 on Amazon EMR, and I've run out of ideas on how to fix it.

The query is something along the lines of

Q1: SELECT count(*) FROM t1 x JOIN t2 y ON (x.id = y.x_id);

Which ends up throwing an exception like this in some of the mappers:

java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error
while
 processing row
{"T1C1":"t2c1\u0001t2c2\u0001t2c3\u0001t2c4\u0001t2c5\u0001t2c6\u0001t2c7\u0001t2c8\u0001t2c8\u0001t2c9\u0001t2c10\u0001t2c11\u0001t2c12\u0001null\u0001null\u0001null\u0001null\u0001t2c18","T1C2":null,"T1C3":null,"T1C4":null,"T1C5":null}
...
...
...
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text
cannot be cast to org.apache.hadoop
.io.IntWritable
at 
org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyIntObjectInspector.get(Laz
yIntObjectInspector.java:38)
...
...

... where TxCy is the name of the yth column in the xth table, and
txcy is the value of the yth column in the xth table for this row.

It looks like the deserializer for table 2 is getting an incorrectly
formatted row from table 1 and is not splitting it on \u0001 as
appears to be intended.

Self joins and selecting from either table separately works fine and
all rows are deserialized correctly in those cases.

Here are the schemas for the tables:

CREATE EXTERNAL TABLE  t1 (
c1 STRING,
...
c5 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3n://t1/';

CREATE EXTERNAL TABLE  t2 (
c1 STRING,
...
c18 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3n://t2/';

Finally, changing Q1 above to use a map-side join:

Q2: SELECT /*+ MAPJOIN(x) */ count(*) FROM t1 x JOIN t2 y ON (x.id = y.x_id);

prevents the exception from occurring at all.

Is this a known bug in Apache Hive 0.8.1 or Amazon's 0.8.1 version?
If so, is there a fix or non-mapside workaround?

Thanks,
Anthony


Re: Map side join

2012-12-27 Thread Souvik Banerjee
Hi,

To conclude this thread I am summarizing my experiences. Correct me if
think // observed otherwise.

1) For Map side join you need to set the flag hive.auto.convert.join=true;
Map side join works well with multiple table and multiple Join condition.
2) You can change the size of the small table according to the RAM
available.
3) If you observe huge volume expansion during join operation the mappers
will take long time. I observed that mappers don't always report status, so
set timeout to high value so that the framework doesn't kill the ongoing
tasks. The mappers eventually completes and job ends successfully.
4) Bringing down the HDFS block size do launches more mappers and very
helpful in such cases where you observer real volume expansion during join.
But it might cause problem to other queries / hadoop jobs.

Thanks and regards,
Souvik.

On Thu, Dec 13, 2012 at 12:36 PM, Souvik Banerjee
wrote:

> Thanks for the help.
> What I did earlier is that I changed the configuration in HDFS and created
> the table. I expected that the block size of the new Table to be of 32 MB.
> But I found that while using Cloudera Manager you need to deploy Change in
> Configuration of both the HDFS and Mapreduce. (I did it only for HDFS)
> Now I deleted the old table and recreated the same. Now I could launch
> more mappers.
> Thanks a lot once again. Will post you what happens with more mappers.
>
> Thanks and regards,
> Souvik.
>
>
> On Thu, Dec 13, 2012 at 12:06 PM,  wrote:
>
>> **
>> Hi Souvik
>>
>> To have the new hdfs block size in effect on the already existing files,
>> you need to re copy them into hdfs.
>>
>> To play with the number of mappers you can set lesser value like 64mb for
>> min and max split size.
>>
>> Mapred.min.split.size and mapred.max.split.size
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> --
>> *From: * Souvik Banerjee 
>> *Date: *Thu, 13 Dec 2012 12:00:16 -0600
>> *To: *; 
>> *Subject: *Re: Map side join
>>
>> Hi Bejoy,
>>
>> The input files are non-compressed text file.
>> There are enough free slots in the cluster.
>>
>> Can you please let me know can I increase the no of mappers?
>> I tried reducing the HDFS block size to 32 MB from 128 MB. I was
>> expecting to get more mappers. But still it's launching same no of mappers
>> like it was doing while the HDFS block size was 128 MB. I have enough map
>> slots available, but not being able to utilize those.
>>
>>
>> Thanks and regards,
>> Souvik.
>>
>>
>> On Thu, Dec 13, 2012 at 11:12 AM,  wrote:
>>
>>> **
>>> Hi Souvik
>>>
>>> Is your input files compressed using some non splittable compression
>>> codec?
>>>
>>> Do you have enough free slots while this job is running?
>>>
>>> Make sure that the job is not running locally.
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> --
>>> *From: * Souvik Banerjee 
>>> *Date: *Wed, 12 Dec 2012 14:27:27 -0600
>>> *To: *; 
>>> *ReplyTo: * user@hive.apache.org
>>> *Subject: *Re: Map side join
>>>
>>> Hi Bejoy,
>>>
>>> Yes I ran the pi example. It was fine.
>>> Regarding the HIVE Job what I found is that it took 4 hrs for the first
>>> map job to get completed.
>>> Those map tasks were doing their job and only reported status after
>>> completion. It is indeed taking too long time to finish. Nothing I could
>>> find relevant in the logs.
>>>
>>> Thanks and regards,
>>> Souvik.
>>>
>>> On Wed, Dec 12, 2012 at 8:04 AM,  wrote:
>>>
>>>> **
>>>> Hi Souvik
>>>>
>>>> Apart from hive jobs is the normal mapreduce jobs like the wordcount
>>>> running fine on your cluster?
>>>>
>>>> If it is working, for the hive jobs are you seeing anything skeptical
>>>> in task, Tasktracker or jobtracker logs?
>>>>
>>>>
>>>> Regards
>>>> Bejoy KS
>>>>
>>>> Sent from remote device, Please excuse typos
>>>> --
>>>> *From: * Souvik Banerjee 
>>>> *Date: *Tue, 11 Dec 2012 17:12:20 -0600
>>>> *To: *; 
>>>> *ReplyTo: * user@hive.apache.org
>>>> *Subject: *Re: Map side join
>>>>
>>>> Hello Everybody,
>>>>
>>&

Re: map side join with group by

2012-12-13 Thread Chen Song
Thanks Nitin. This is all I want to clarify :)

Chen

On Thu, Dec 13, 2012 at 2:30 PM, Nitin Pawar wrote:

> to improve the speed of the job they created map only joins so that all
> the records associated with a key fall to a map .. reducers slows it down.
> If the reducer has to do some more job then they launch another job.
>
> bear in mind, when we say map only join we are absolutely sure that speed
> will increase in case data in one of the tables is in the few hundred MB
> ranges. If this has to do with reduce in hand, the processing logic
> completely changes and it also slows down.
>
> Launching a new job for group by is a neat way to measure how much time
> you spent on just join and another on group by so you can easily see two
> different things.
>
> There is no way you can ask a mapjoin to launch a reducer as it is not
> supposed to do.
>
> If you have such case (may be if you think that it will improve
> performance), please feel free to raise a jira and get it reviewed. if its
> valid I think people will provide more ideas
>
>
> On Fri, Dec 14, 2012 at 12:42 AM, Chen Song wrote:
>
>> Nitin
>>
>> Yeah. My original question is that is there a way to force Hive (or
>> rather to say, is it possible) to execute map side join at mapper phase and
>> group by in reduce phase. So instead of launching a map only job (join) and
>> map reduce job (group by), doing it altogether in a single MR job. This is
>> obviously not what Hive does but I am wondering if it is a nice feature to
>> have.
>>
>> The point you made (different keys in join and group by) only matters
>> when it is the time in reduce phase, right? As map side join takes care of
>> join at mapper phase, it sounds to me natural that group by can be done in
>> the reduce phase in the same job. The only hassle that I can think of is
>> that map output have to be resorted (based on group by keys).
>>
>> Chen
>>
>> On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar wrote:
>>
>>> chen in mapside join .. there are no reducers .. its MAP ONLY job
>>>
>>>
>>> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song wrote:
>>>
>>>> Understood that fact that it is impossible in the same MR job if both
>>>> join and group by are gonna happen in the reduce phase (because the join
>>>> keys and group by keys are different). But for map side join, the joins
>>>> would be complete by the end of the map phase, and outputs should be ready
>>>> to be distributed to reducers based on group by keys.
>>>>
>>>> Chen
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar 
>>>> wrote:
>>>>
>>>>> Thats because for the first job the join keys are different and second
>>>>> job group by keys are different, you just cant assume join keys and group
>>>>> keys will be same so they are two different jobs
>>>>>
>>>>>
>>>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote:
>>>>>
>>>>>> Yeah, my abridged version of query might be a little broken but my
>>>>>> point is that when a query has a map join and group by, even in its
>>>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>>>> map join and group by cannot be accomplished in one MR job.
>>>>>>
>>>>>> Best,
>>>>>> Chen
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <
>>>>>> nitinpawar...@gmail.com> wrote:
>>>>>>
>>>>>>> I think Chen wanted to know why this is two phased query if I
>>>>>>> understood it correctly
>>>>>>>
>>>>>>> When you run a mapside join .. it just performs the join query ..
>>>>>>> after that to execute the group by part it launches the second job.
>>>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>>>> queries
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>>>> grover.markgro...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Chen,
>>>>>>>> I think we would need some more information.
>>>>>>>>
>>>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but

Re: map side join with group by

2012-12-13 Thread Nitin Pawar
to improve the speed of the job they created map only joins so that all the
records associated with a key fall to a map .. reducers slows it down. If
the reducer has to do some more job then they launch another job.

bear in mind, when we say map only join we are absolutely sure that speed
will increase in case data in one of the tables is in the few hundred MB
ranges. If this has to do with reduce in hand, the processing logic
completely changes and it also slows down.

Launching a new job for group by is a neat way to measure how much time you
spent on just join and another on group by so you can easily see two
different things.

There is no way you can ask a mapjoin to launch a reducer as it is not
supposed to do.

If you have such case (may be if you think that it will improve
performance), please feel free to raise a jira and get it reviewed. if its
valid I think people will provide more ideas


On Fri, Dec 14, 2012 at 12:42 AM, Chen Song  wrote:

> Nitin
>
> Yeah. My original question is that is there a way to force Hive (or rather
> to say, is it possible) to execute map side join at mapper phase and group
> by in reduce phase. So instead of launching a map only job (join) and map
> reduce job (group by), doing it altogether in a single MR job. This is
> obviously not what Hive does but I am wondering if it is a nice feature to
> have.
>
> The point you made (different keys in join and group by) only matters when
> it is the time in reduce phase, right? As map side join takes care of join
> at mapper phase, it sounds to me natural that group by can be done in the
> reduce phase in the same job. The only hassle that I can think of is that
> map output have to be resorted (based on group by keys).
>
> Chen
>
> On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar wrote:
>
>> chen in mapside join .. there are no reducers .. its MAP ONLY job
>>
>>
>> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song wrote:
>>
>>> Understood that fact that it is impossible in the same MR job if both
>>> join and group by are gonna happen in the reduce phase (because the join
>>> keys and group by keys are different). But for map side join, the joins
>>> would be complete by the end of the map phase, and outputs should be ready
>>> to be distributed to reducers based on group by keys.
>>>
>>> Chen
>>>
>>>
>>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar 
>>> wrote:
>>>
>>>> Thats because for the first job the join keys are different and second
>>>> job group by keys are different, you just cant assume join keys and group
>>>> keys will be same so they are two different jobs
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote:
>>>>
>>>>> Yeah, my abridged version of query might be a little broken but my
>>>>> point is that when a query has a map join and group by, even in its
>>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>>> map join and group by cannot be accomplished in one MR job.
>>>>>
>>>>> Best,
>>>>> Chen
>>>>>
>>>>>
>>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar >>>> > wrote:
>>>>>
>>>>>> I think Chen wanted to know why this is two phased query if I
>>>>>> understood it correctly
>>>>>>
>>>>>> When you run a mapside join .. it just performs the join query ..
>>>>>> after that to execute the group by part it launches the second job.
>>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>>> queries
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>>> grover.markgro...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Chen,
>>>>>>> I think we would need some more information.
>>>>>>>
>>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>>>> sense when the right table is the one being "mapped" (in other words,
>>>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>>>>

Re: map side join with group by

2012-12-13 Thread Chen Song
Nitin

Yeah. My original question is that is there a way to force Hive (or rather
to say, is it possible) to execute map side join at mapper phase and group
by in reduce phase. So instead of launching a map only job (join) and map
reduce job (group by), doing it altogether in a single MR job. This is
obviously not what Hive does but I am wondering if it is a nice feature to
have.

The point you made (different keys in join and group by) only matters when
it is the time in reduce phase, right? As map side join takes care of join
at mapper phase, it sounds to me natural that group by can be done in the
reduce phase in the same job. The only hassle that I can think of is that
map output have to be resorted (based on group by keys).

Chen

On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar wrote:

> chen in mapside join .. there are no reducers .. its MAP ONLY job
>
>
> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song wrote:
>
>> Understood that fact that it is impossible in the same MR job if both
>> join and group by are gonna happen in the reduce phase (because the join
>> keys and group by keys are different). But for map side join, the joins
>> would be complete by the end of the map phase, and outputs should be ready
>> to be distributed to reducers based on group by keys.
>>
>> Chen
>>
>>
>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar wrote:
>>
>>> Thats because for the first job the join keys are different and second
>>> job group by keys are different, you just cant assume join keys and group
>>> keys will be same so they are two different jobs
>>>
>>>
>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote:
>>>
>>>> Yeah, my abridged version of query might be a little broken but my
>>>> point is that when a query has a map join and group by, even in its
>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>> map join and group by cannot be accomplished in one MR job.
>>>>
>>>> Best,
>>>> Chen
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar 
>>>> wrote:
>>>>
>>>>> I think Chen wanted to know why this is two phased query if I
>>>>> understood it correctly
>>>>>
>>>>> When you run a mapside join .. it just performs the join query ..
>>>>> after that to execute the group by part it launches the second job.
>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>> queries
>>>>>
>>>>>
>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>> grover.markgro...@gmail.com> wrote:
>>>>>
>>>>>> Hi Chen,
>>>>>> I think we would need some more information.
>>>>>>
>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>>> sense when the right table is the one being "mapped" (in other words,
>>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>>>> explanation.
>>>>>>
>>>>>> In your query, the where clause on a column called "hour", at this
>>>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>>>> column on table1, that predicate would get pushed up (if you have
>>>>>> hive.optimize.ppd property set to true), so it could possibly be done
>>>>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>>>>> to check the explain plan). If however, the where clause is on a
>>>>>> column in the right table (table2 in your example), it can't be pushed
>>>>>> up since a column of the right table can have different values before
>>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>>>>> to be applied in a separate MR job.
>>>>>>
>>>>>> This is just my understanding, the full proof answer would lie in
>>>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>>>
>>>>>> And for completeness, there is a conditional task (starting Hive 0

Re: map side join with group by

2012-12-13 Thread Nitin Pawar
chen in mapside join .. there are no reducers .. its MAP ONLY job


On Thu, Dec 13, 2012 at 11:54 PM, Chen Song  wrote:

> Understood that fact that it is impossible in the same MR job if both join
> and group by are gonna happen in the reduce phase (because the join keys
> and group by keys are different). But for map side join, the joins would be
> complete by the end of the map phase, and outputs should be ready to be
> distributed to reducers based on group by keys.
>
> Chen
>
>
> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar wrote:
>
>> Thats because for the first job the join keys are different and second
>> job group by keys are different, you just cant assume join keys and group
>> keys will be same so they are two different jobs
>>
>>
>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote:
>>
>>> Yeah, my abridged version of query might be a little broken but my point
>>> is that when a query has a map join and group by, even in its simplified
>>> incarnation, it will launch two jobs. I was just wondering why map join and
>>> group by cannot be accomplished in one MR job.
>>>
>>> Best,
>>> Chen
>>>
>>>
>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar 
>>> wrote:
>>>
>>>> I think Chen wanted to know why this is two phased query if I
>>>> understood it correctly
>>>>
>>>> When you run a mapside join .. it just performs the join query .. after
>>>> that to execute the group by part it launches the second job.
>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>> queries
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>> grover.markgro...@gmail.com> wrote:
>>>>
>>>>> Hi Chen,
>>>>> I think we would need some more information.
>>>>>
>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>> sense when the right table is the one being "mapped" (in other words,
>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>>> explanation.
>>>>>
>>>>> In your query, the where clause on a column called "hour", at this
>>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>>> column on table1, that predicate would get pushed up (if you have
>>>>> hive.optimize.ppd property set to true), so it could possibly be done
>>>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>>>> to check the explain plan). If however, the where clause is on a
>>>>> column in the right table (table2 in your example), it can't be pushed
>>>>> up since a column of the right table can have different values before
>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>>>> to be applied in a separate MR job.
>>>>>
>>>>> This is just my understanding, the full proof answer would lie in
>>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>>
>>>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>>>> that will convert your joins automatically to map joins where
>>>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>>>> property.
>>>>>
>>>>> Mark
>>>>>
>>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song 
>>>>> wrote:
>>>>> > I have a silly question on how Hive interpretes a simple query with
>>>>> both map
>>>>> > side join and group by.
>>>>> >
>>>>> > Below query will translate into two jobs, with the 1st one as a map
>>>>> only job
>>>>> > doing the join and storing the output in a intermediary location,
>>>>> and the
>>>>> > 2nd one as a map-reduce job taking the output of the 1st job as
>>>>> input and
>>>>> > doing the group by.
>>>>> >
>>>>> > SELECT
>>>>> > /*+ MAPJOIN(d) */
>>>>> > table.a, sum(table2.b)
>>>>> > from table
>>>>> > LEFT OUTER JOIN table2
>>>>> > ON table.id = table2.id
>>>>> > where hour = '2012-12-11 11'
>>>>> > group by table.a
>>>>> >
>>>>> > Why can't this be done within a single map reduce job? As what I can
>>>>> see
>>>>> > from the query plan is that all 2nd job mapper do is taking the 1st
>>>>> job's
>>>>> > mapper output.
>>>>> >
>>>>> > --
>>>>> > Chen Song
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Chen Song
>
>
>


-- 
Nitin Pawar


Re: Map side join

2012-12-13 Thread Souvik Banerjee
Thanks for the help.
What I did earlier is that I changed the configuration in HDFS and created
the table. I expected that the block size of the new Table to be of 32 MB.
But I found that while using Cloudera Manager you need to deploy Change in
Configuration of both the HDFS and Mapreduce. (I did it only for HDFS)
Now I deleted the old table and recreated the same. Now I could launch more
mappers.
Thanks a lot once again. Will post you what happens with more mappers.

Thanks and regards,
Souvik.

On Thu, Dec 13, 2012 at 12:06 PM,  wrote:

> **
> Hi Souvik
>
> To have the new hdfs block size in effect on the already existing files,
> you need to re copy them into hdfs.
>
> To play with the number of mappers you can set lesser value like 64mb for
> min and max split size.
>
> Mapred.min.split.size and mapred.max.split.size
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Souvik Banerjee 
> *Date: *Thu, 13 Dec 2012 12:00:16 -0600
> *To: *; 
> *Subject: *Re: Map side join
>
> Hi Bejoy,
>
> The input files are non-compressed text file.
> There are enough free slots in the cluster.
>
> Can you please let me know can I increase the no of mappers?
> I tried reducing the HDFS block size to 32 MB from 128 MB. I was expecting
> to get more mappers. But still it's launching same no of mappers like it
> was doing while the HDFS block size was 128 MB. I have enough map slots
> available, but not being able to utilize those.
>
>
> Thanks and regards,
> Souvik.
>
>
> On Thu, Dec 13, 2012 at 11:12 AM,  wrote:
>
>> **
>> Hi Souvik
>>
>> Is your input files compressed using some non splittable compression
>> codec?
>>
>> Do you have enough free slots while this job is running?
>>
>> Make sure that the job is not running locally.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> --
>> *From: * Souvik Banerjee 
>> *Date: *Wed, 12 Dec 2012 14:27:27 -0600
>> *To: *; 
>> *ReplyTo: * user@hive.apache.org
>> *Subject: *Re: Map side join
>>
>> Hi Bejoy,
>>
>> Yes I ran the pi example. It was fine.
>> Regarding the HIVE Job what I found is that it took 4 hrs for the first
>> map job to get completed.
>> Those map tasks were doing their job and only reported status after
>> completion. It is indeed taking too long time to finish. Nothing I could
>> find relevant in the logs.
>>
>> Thanks and regards,
>> Souvik.
>>
>> On Wed, Dec 12, 2012 at 8:04 AM,  wrote:
>>
>>> **
>>> Hi Souvik
>>>
>>> Apart from hive jobs is the normal mapreduce jobs like the wordcount
>>> running fine on your cluster?
>>>
>>> If it is working, for the hive jobs are you seeing anything skeptical in
>>> task, Tasktracker or jobtracker logs?
>>>
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> --
>>> *From: * Souvik Banerjee 
>>> *Date: *Tue, 11 Dec 2012 17:12:20 -0600
>>> *To: *; 
>>> *ReplyTo: * user@hive.apache.org
>>> *Subject: *Re: Map side join
>>>
>>> Hello Everybody,
>>>
>>> Need help in for on HIVE join. As we were talking about the Map side
>>> join I tried that.
>>> I set the flag set hive.auto.convert.join=true;
>>>
>>> I saw Hive converts the same to map join while launching the job. But
>>> the problem is that none of the map job progresses in my case. I made the
>>> dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
>>> done very quickly.
>>> No luck with any change of settings.
>>> Failing to progress with the default setting changes these settings.
>>> set hive.mapred.local.mem=1024; // Initially it was 216 I guess
>>> set hive.join.cache.size=10; // Initialliu it was 25000
>>>
>>> Also on Hadoop side I made this changes
>>>
>>> mapred.child.java.opts -Xmx1073741824
>>>
>>> But I don't see any progress. After more than 40 minutes of run I am at
>>> 0% map completion state.
>>> Can you please throw some light on this?
>>>
>>> Thanks a lot once again.
>>>
>>> Regards,
>>> Souvik.
>>>
>>>
>>>
>>> On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee <
>>> souvikbaner...@gmail.com> wrote:
>>>
>>>> Hi Bejoy,
>&

Re: map side join with group by

2012-12-13 Thread Chen Song
Understood that fact that it is impossible in the same MR job if both join
and group by are gonna happen in the reduce phase (because the join keys
and group by keys are different). But for map side join, the joins would be
complete by the end of the map phase, and outputs should be ready to be
distributed to reducers based on group by keys.

Chen

On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar wrote:

> Thats because for the first job the join keys are different and second job
> group by keys are different, you just cant assume join keys and group keys
> will be same so they are two different jobs
>
>
> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song  wrote:
>
>> Yeah, my abridged version of query might be a little broken but my point
>> is that when a query has a map join and group by, even in its simplified
>> incarnation, it will launch two jobs. I was just wondering why map join and
>> group by cannot be accomplished in one MR job.
>>
>> Best,
>> Chen
>>
>>
>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar wrote:
>>
>>> I think Chen wanted to know why this is two phased query if I understood
>>> it correctly
>>>
>>> When you run a mapside join .. it just performs the join query .. after
>>> that to execute the group by part it launches the second job.
>>> I may be wrong but this is how I saw it whenever I executed group by
>>> queries
>>>
>>>
>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>> grover.markgro...@gmail.com> wrote:
>>>
>>>> Hi Chen,
>>>> I think we would need some more information.
>>>>
>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>> there is not such table in the query. Moreover, Map joins only make
>>>> sense when the right table is the one being "mapped" (in other words,
>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>> explanation.
>>>>
>>>> In your query, the where clause on a column called "hour", at this
>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>> column on table1, that predicate would get pushed up (if you have
>>>> hive.optimize.ppd property set to true), so it could possibly be done
>>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>>> to check the explain plan). If however, the where clause is on a
>>>> column in the right table (table2 in your example), it can't be pushed
>>>> up since a column of the right table can have different values before
>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>>> to be applied in a separate MR job.
>>>>
>>>> This is just my understanding, the full proof answer would lie in
>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>
>>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>>> that will convert your joins automatically to map joins where
>>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>>> property.
>>>>
>>>> Mark
>>>>
>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song 
>>>> wrote:
>>>> > I have a silly question on how Hive interpretes a simple query with
>>>> both map
>>>> > side join and group by.
>>>> >
>>>> > Below query will translate into two jobs, with the 1st one as a map
>>>> only job
>>>> > doing the join and storing the output in a intermediary location, and
>>>> the
>>>> > 2nd one as a map-reduce job taking the output of the 1st job as input
>>>> and
>>>> > doing the group by.
>>>> >
>>>> > SELECT
>>>> > /*+ MAPJOIN(d) */
>>>> > table.a, sum(table2.b)
>>>> > from table
>>>> > LEFT OUTER JOIN table2
>>>> > ON table.id = table2.id
>>>> > where hour = '2012-12-11 11'
>>>> > group by table.a
>>>> >
>>>> > Why can't this be done within a single map reduce job? As what I can
>>>> see
>>>> > from the query plan is that all 2nd job mapper do is taking the 1st
>>>> job's
>>>> > mapper output.
>>>> >
>>>> > --
>>>> > Chen Song
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Chen Song
>>
>>
>>
>
>
> --
> Nitin Pawar
>



-- 
Chen Song


Re: Map side join

2012-12-13 Thread bejoy_ks
Hi Souvik

To have the new hdfs block size in effect on the already existing files, you 
need to re copy them into hdfs.

To play with the number of mappers you can set lesser value like 64mb for min 
and max split size.

Mapred.min.split.size and mapred.max.split.size

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee 
Date: Thu, 13 Dec 2012 12:00:16 
To: ; 
Subject: Re: Map side join

Hi Bejoy,

The input files are non-compressed text file.
There are enough free slots in the cluster.

Can you please let me know can I increase the no of mappers?
I tried reducing the HDFS block size to 32 MB from 128 MB. I was expecting
to get more mappers. But still it's launching same no of mappers like it
was doing while the HDFS block size was 128 MB. I have enough map slots
available, but not being able to utilize those.


Thanks and regards,
Souvik.


On Thu, Dec 13, 2012 at 11:12 AM,  wrote:

> **
> Hi Souvik
>
> Is your input files compressed using some non splittable compression codec?
>
> Do you have enough free slots while this job is running?
>
> Make sure that the job is not running locally.
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Souvik Banerjee 
> *Date: *Wed, 12 Dec 2012 14:27:27 -0600
> *To: *; 
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: Map side join
>
> Hi Bejoy,
>
> Yes I ran the pi example. It was fine.
> Regarding the HIVE Job what I found is that it took 4 hrs for the first
> map job to get completed.
> Those map tasks were doing their job and only reported status after
> completion. It is indeed taking too long time to finish. Nothing I could
> find relevant in the logs.
>
> Thanks and regards,
> Souvik.
>
> On Wed, Dec 12, 2012 at 8:04 AM,  wrote:
>
>> **
>> Hi Souvik
>>
>> Apart from hive jobs is the normal mapreduce jobs like the wordcount
>> running fine on your cluster?
>>
>> If it is working, for the hive jobs are you seeing anything skeptical in
>> task, Tasktracker or jobtracker logs?
>>
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> --
>> *From: * Souvik Banerjee 
>> *Date: *Tue, 11 Dec 2012 17:12:20 -0600
>> *To: *; 
>> *ReplyTo: * user@hive.apache.org
>> *Subject: *Re: Map side join
>>
>> Hello Everybody,
>>
>> Need help in for on HIVE join. As we were talking about the Map side join
>> I tried that.
>> I set the flag set hive.auto.convert.join=true;
>>
>> I saw Hive converts the same to map join while launching the job. But the
>> problem is that none of the map job progresses in my case. I made the
>> dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
>> done very quickly.
>> No luck with any change of settings.
>> Failing to progress with the default setting changes these settings.
>> set hive.mapred.local.mem=1024; // Initially it was 216 I guess
>> set hive.join.cache.size=10; // Initialliu it was 25000
>>
>> Also on Hadoop side I made this changes
>>
>> mapred.child.java.opts -Xmx1073741824
>>
>> But I don't see any progress. After more than 40 minutes of run I am at
>> 0% map completion state.
>> Can you please throw some light on this?
>>
>> Thanks a lot once again.
>>
>> Regards,
>> Souvik.
>>
>>
>>
>> On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee > > wrote:
>>
>>> Hi Bejoy,
>>>
>>> That's wonderful. Thanks for your reply.
>>> What I was wondering if HIVE can do map side join with more than one
>>> condition on JOIN clause.
>>> I'll simply try it out and post the result.
>>>
>>> Thanks once again.
>>>
>>> Regards,
>>> Souvik.
>>>
>>>  On Fri, Dec 7, 2012 at 2:10 PM,  wrote:
>>>
>>>> **
>>>> Hi Souvik
>>>>
>>>> In earlier versions of hive you had to give the map join hint. But in
>>>> later versions just set hive.auto.convert.join = true;
>>>> Hive automatically selects the smaller table. It is better to give the
>>>> smaller table as the first one in join.
>>>>
>>>> You can use a map join if you are joining a small table with a large
>>>> one, in terms of data size. By small, better to have the smaller table size
>>>> in range of MBs.
>>>> Regards
>>>> Bejoy KS
>>>>
>>>> Sent from remote device, Plea

Re: Map side join

2012-12-13 Thread Souvik Banerjee
Hi Bejoy,

The input files are non-compressed text file.
There are enough free slots in the cluster.

Can you please let me know can I increase the no of mappers?
I tried reducing the HDFS block size to 32 MB from 128 MB. I was expecting
to get more mappers. But still it's launching same no of mappers like it
was doing while the HDFS block size was 128 MB. I have enough map slots
available, but not being able to utilize those.


Thanks and regards,
Souvik.


On Thu, Dec 13, 2012 at 11:12 AM,  wrote:

> **
> Hi Souvik
>
> Is your input files compressed using some non splittable compression codec?
>
> Do you have enough free slots while this job is running?
>
> Make sure that the job is not running locally.
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Souvik Banerjee 
> *Date: *Wed, 12 Dec 2012 14:27:27 -0600
> *To: *; 
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: Map side join
>
> Hi Bejoy,
>
> Yes I ran the pi example. It was fine.
> Regarding the HIVE Job what I found is that it took 4 hrs for the first
> map job to get completed.
> Those map tasks were doing their job and only reported status after
> completion. It is indeed taking too long time to finish. Nothing I could
> find relevant in the logs.
>
> Thanks and regards,
> Souvik.
>
> On Wed, Dec 12, 2012 at 8:04 AM,  wrote:
>
>> **
>> Hi Souvik
>>
>> Apart from hive jobs is the normal mapreduce jobs like the wordcount
>> running fine on your cluster?
>>
>> If it is working, for the hive jobs are you seeing anything skeptical in
>> task, Tasktracker or jobtracker logs?
>>
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> --
>> *From: * Souvik Banerjee 
>> *Date: *Tue, 11 Dec 2012 17:12:20 -0600
>> *To: *; 
>> *ReplyTo: * user@hive.apache.org
>> *Subject: *Re: Map side join
>>
>> Hello Everybody,
>>
>> Need help in for on HIVE join. As we were talking about the Map side join
>> I tried that.
>> I set the flag set hive.auto.convert.join=true;
>>
>> I saw Hive converts the same to map join while launching the job. But the
>> problem is that none of the map job progresses in my case. I made the
>> dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
>> done very quickly.
>> No luck with any change of settings.
>> Failing to progress with the default setting changes these settings.
>> set hive.mapred.local.mem=1024; // Initially it was 216 I guess
>> set hive.join.cache.size=10; // Initialliu it was 25000
>>
>> Also on Hadoop side I made this changes
>>
>> mapred.child.java.opts -Xmx1073741824
>>
>> But I don't see any progress. After more than 40 minutes of run I am at
>> 0% map completion state.
>> Can you please throw some light on this?
>>
>> Thanks a lot once again.
>>
>> Regards,
>> Souvik.
>>
>>
>>
>> On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee > > wrote:
>>
>>> Hi Bejoy,
>>>
>>> That's wonderful. Thanks for your reply.
>>> What I was wondering if HIVE can do map side join with more than one
>>> condition on JOIN clause.
>>> I'll simply try it out and post the result.
>>>
>>> Thanks once again.
>>>
>>> Regards,
>>> Souvik.
>>>
>>>  On Fri, Dec 7, 2012 at 2:10 PM,  wrote:
>>>
>>>> **
>>>> Hi Souvik
>>>>
>>>> In earlier versions of hive you had to give the map join hint. But in
>>>> later versions just set hive.auto.convert.join = true;
>>>> Hive automatically selects the smaller table. It is better to give the
>>>> smaller table as the first one in join.
>>>>
>>>> You can use a map join if you are joining a small table with a large
>>>> one, in terms of data size. By small, better to have the smaller table size
>>>> in range of MBs.
>>>> Regards
>>>> Bejoy KS
>>>>
>>>> Sent from remote device, Please excuse typos
>>>> --
>>>> *From: *Souvik Banerjee 
>>>> *Date: *Fri, 7 Dec 2012 13:58:25 -0600
>>>> *To: *
>>>> *ReplyTo: *user@hive.apache.org
>>>> *Subject: *Map side join
>>>>
>>>> Hello everybody,
>>>>
>>>> I have got a question. I didn't came across any post which says
>>>> somethign about this.
>>>> I have got two tables. Lets say A and B.
>>>> I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
>>>> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
>>>> B.id2) AND (A.id3 = B.id3)
>>>>
>>>> Can I ask HIVE to use map side join in this scenario? Should I give a
>>>> hint to HIVE by saying /*+mapjoin(B)*/
>>>>
>>>> Get back to me if you want any more information in this regard.
>>>>
>>>> Thanks and regards,
>>>> Souvik.
>>>>
>>>
>>>
>>
>


Re: Map side join

2012-12-13 Thread bejoy_ks
Hi Souvik

Is your input files compressed using some non splittable compression codec?

Do you have enough free slots while this job is running?

Make sure that the job is not running locally.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee 
Date: Wed, 12 Dec 2012 14:27:27 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: Map side join

Hi Bejoy,

Yes I ran the pi example. It was fine.
Regarding the HIVE Job what I found is that it took 4 hrs for the first map
job to get completed.
Those map tasks were doing their job and only reported status after
completion. It is indeed taking too long time to finish. Nothing I could
find relevant in the logs.

Thanks and regards,
Souvik.

On Wed, Dec 12, 2012 at 8:04 AM,  wrote:

> **
> Hi Souvik
>
> Apart from hive jobs is the normal mapreduce jobs like the wordcount
> running fine on your cluster?
>
> If it is working, for the hive jobs are you seeing anything skeptical in
> task, Tasktracker or jobtracker logs?
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Souvik Banerjee 
> *Date: *Tue, 11 Dec 2012 17:12:20 -0600
> *To: *; 
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: Map side join
>
> Hello Everybody,
>
> Need help in for on HIVE join. As we were talking about the Map side join
> I tried that.
> I set the flag set hive.auto.convert.join=true;
>
> I saw Hive converts the same to map join while launching the job. But the
> problem is that none of the map job progresses in my case. I made the
> dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
> done very quickly.
> No luck with any change of settings.
> Failing to progress with the default setting changes these settings.
> set hive.mapred.local.mem=1024; // Initially it was 216 I guess
> set hive.join.cache.size=10; // Initialliu it was 25000
>
> Also on Hadoop side I made this changes
>
> mapred.child.java.opts -Xmx1073741824
>
> But I don't see any progress. After more than 40 minutes of run I am at 0%
> map completion state.
> Can you please throw some light on this?
>
> Thanks a lot once again.
>
> Regards,
> Souvik.
>
>
>
> On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee 
> wrote:
>
>> Hi Bejoy,
>>
>> That's wonderful. Thanks for your reply.
>> What I was wondering if HIVE can do map side join with more than one
>> condition on JOIN clause.
>> I'll simply try it out and post the result.
>>
>> Thanks once again.
>>
>> Regards,
>> Souvik.
>>
>>  On Fri, Dec 7, 2012 at 2:10 PM,  wrote:
>>
>>> **
>>> Hi Souvik
>>>
>>> In earlier versions of hive you had to give the map join hint. But in
>>> later versions just set hive.auto.convert.join = true;
>>> Hive automatically selects the smaller table. It is better to give the
>>> smaller table as the first one in join.
>>>
>>> You can use a map join if you are joining a small table with a large
>>> one, in terms of data size. By small, better to have the smaller table size
>>> in range of MBs.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> --
>>> *From: *Souvik Banerjee 
>>> *Date: *Fri, 7 Dec 2012 13:58:25 -0600
>>> *To: *
>>> *ReplyTo: *user@hive.apache.org
>>> *Subject: *Map side join
>>>
>>> Hello everybody,
>>>
>>> I have got a question. I didn't came across any post which says
>>> somethign about this.
>>> I have got two tables. Lets say A and B.
>>> I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
>>> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
>>> B.id2) AND (A.id3 = B.id3)
>>>
>>> Can I ask HIVE to use map side join in this scenario? Should I give a
>>> hint to HIVE by saying /*+mapjoin(B)*/
>>>
>>> Get back to me if you want any more information in this regard.
>>>
>>> Thanks and regards,
>>> Souvik.
>>>
>>
>>
>



Re: map side join with group by

2012-12-13 Thread Nitin Pawar
Thats because for the first job the join keys are different and second job
group by keys are different, you just cant assume join keys and group keys
will be same so they are two different jobs


On Thu, Dec 13, 2012 at 8:26 PM, Chen Song  wrote:

> Yeah, my abridged version of query might be a little broken but my point
> is that when a query has a map join and group by, even in its simplified
> incarnation, it will launch two jobs. I was just wondering why map join and
> group by cannot be accomplished in one MR job.
>
> Best,
> Chen
>
>
> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar wrote:
>
>> I think Chen wanted to know why this is two phased query if I understood
>> it correctly
>>
>> When you run a mapside join .. it just performs the join query .. after
>> that to execute the group by part it launches the second job.
>> I may be wrong but this is how I saw it whenever I executed group by
>> queries
>>
>>
>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover > > wrote:
>>
>>> Hi Chen,
>>> I think we would need some more information.
>>>
>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>> there is not such table in the query. Moreover, Map joins only make
>>> sense when the right table is the one being "mapped" (in other words,
>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>> left table is the one being "mapped" in case of a Right Outer Join.
>>> Let me know if this is not clear, I'd be happy to offer a better
>>> explanation.
>>>
>>> In your query, the where clause on a column called "hour", at this
>>> point I am unsure if that's a column of table1 or table2. If it's
>>> column on table1, that predicate would get pushed up (if you have
>>> hive.optimize.ppd property set to true), so it could possibly be done
>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>> to check the explain plan). If however, the where clause is on a
>>> column in the right table (table2 in your example), it can't be pushed
>>> up since a column of the right table can have different values before
>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>> to be applied in a separate MR job.
>>>
>>> This is just my understanding, the full proof answer would lie in
>>> checking out the explain plans and the Semantic Analyzer code.
>>>
>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>> that will convert your joins automatically to map joins where
>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>> property.
>>>
>>> Mark
>>>
>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song 
>>> wrote:
>>> > I have a silly question on how Hive interpretes a simple query with
>>> both map
>>> > side join and group by.
>>> >
>>> > Below query will translate into two jobs, with the 1st one as a map
>>> only job
>>> > doing the join and storing the output in a intermediary location, and
>>> the
>>> > 2nd one as a map-reduce job taking the output of the 1st job as input
>>> and
>>> > doing the group by.
>>> >
>>> > SELECT
>>> > /*+ MAPJOIN(d) */
>>> > table.a, sum(table2.b)
>>> > from table
>>> > LEFT OUTER JOIN table2
>>> > ON table.id = table2.id
>>> > where hour = '2012-12-11 11'
>>> > group by table.a
>>> >
>>> > Why can't this be done within a single map reduce job? As what I can
>>> see
>>> > from the query plan is that all 2nd job mapper do is taking the 1st
>>> job's
>>> > mapper output.
>>> >
>>> > --
>>> > Chen Song
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Chen Song
>
>
>


-- 
Nitin Pawar


Re: map side join with group by

2012-12-13 Thread Chen Song
Yeah, my abridged version of query might be a little broken but my point is
that when a query has a map join and group by, even in its simplified
incarnation, it will launch two jobs. I was just wondering why map join and
group by cannot be accomplished in one MR job.

Best,
Chen

On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar wrote:

> I think Chen wanted to know why this is two phased query if I understood
> it correctly
>
> When you run a mapside join .. it just performs the join query .. after
> that to execute the group by part it launches the second job.
> I may be wrong but this is how I saw it whenever I executed group by
> queries
>
>
> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover 
> wrote:
>
>> Hi Chen,
>> I think we would need some more information.
>>
>> The query is referring to a table called "d" in the MAPJOIN hint but
>> there is not such table in the query. Moreover, Map joins only make
>> sense when the right table is the one being "mapped" (in other words,
>> being kept in memory) in case of a Left Outer Join, similarly if the
>> left table is the one being "mapped" in case of a Right Outer Join.
>> Let me know if this is not clear, I'd be happy to offer a better
>> explanation.
>>
>> In your query, the where clause on a column called "hour", at this
>> point I am unsure if that's a column of table1 or table2. If it's
>> column on table1, that predicate would get pushed up (if you have
>> hive.optimize.ppd property set to true), so it could possibly be done
>> in 1 MR job (I am not sure if that's presently the case, you will have
>> to check the explain plan). If however, the where clause is on a
>> column in the right table (table2 in your example), it can't be pushed
>> up since a column of the right table can have different values before
>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>> to be applied in a separate MR job.
>>
>> This is just my understanding, the full proof answer would lie in
>> checking out the explain plans and the Semantic Analyzer code.
>>
>> And for completeness, there is a conditional task (starting Hive 0.7)
>> that will convert your joins automatically to map joins where
>> applicable. This can be enabled by enabling hive.auto.convert.join
>> property.
>>
>> Mark
>>
>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song 
>> wrote:
>> > I have a silly question on how Hive interpretes a simple query with
>> both map
>> > side join and group by.
>> >
>> > Below query will translate into two jobs, with the 1st one as a map
>> only job
>> > doing the join and storing the output in a intermediary location, and
>> the
>> > 2nd one as a map-reduce job taking the output of the 1st job as input
>> and
>> > doing the group by.
>> >
>> > SELECT
>> > /*+ MAPJOIN(d) */
>> > table.a, sum(table2.b)
>> > from table
>> > LEFT OUTER JOIN table2
>> > ON table.id = table2.id
>> > where hour = '2012-12-11 11'
>> > group by table.a
>> >
>> > Why can't this be done within a single map reduce job? As what I can see
>> > from the query plan is that all 2nd job mapper do is taking the 1st
>> job's
>> > mapper output.
>> >
>> > --
>> > Chen Song
>> >
>> >
>>
>
>
>
> --
> Nitin Pawar
>



-- 
Chen Song


Re: map side join with group by

2012-12-12 Thread Nitin Pawar
I think Chen wanted to know why this is two phased query if I understood it
correctly

When you run a mapside join .. it just performs the join query .. after
that to execute the group by part it launches the second job.
I may be wrong but this is how I saw it whenever I executed group by
queries


On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover wrote:

> Hi Chen,
> I think we would need some more information.
>
> The query is referring to a table called "d" in the MAPJOIN hint but
> there is not such table in the query. Moreover, Map joins only make
> sense when the right table is the one being "mapped" (in other words,
> being kept in memory) in case of a Left Outer Join, similarly if the
> left table is the one being "mapped" in case of a Right Outer Join.
> Let me know if this is not clear, I'd be happy to offer a better
> explanation.
>
> In your query, the where clause on a column called "hour", at this
> point I am unsure if that's a column of table1 or table2. If it's
> column on table1, that predicate would get pushed up (if you have
> hive.optimize.ppd property set to true), so it could possibly be done
> in 1 MR job (I am not sure if that's presently the case, you will have
> to check the explain plan). If however, the where clause is on a
> column in the right table (table2 in your example), it can't be pushed
> up since a column of the right table can have different values before
> and after the LEFT OUTER JOIN. Therefore, the where clause would need
> to be applied in a separate MR job.
>
> This is just my understanding, the full proof answer would lie in
> checking out the explain plans and the Semantic Analyzer code.
>
> And for completeness, there is a conditional task (starting Hive 0.7)
> that will convert your joins automatically to map joins where
> applicable. This can be enabled by enabling hive.auto.convert.join
> property.
>
> Mark
>
> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song  wrote:
> > I have a silly question on how Hive interpretes a simple query with both
> map
> > side join and group by.
> >
> > Below query will translate into two jobs, with the 1st one as a map only
> job
> > doing the join and storing the output in a intermediary location, and the
> > 2nd one as a map-reduce job taking the output of the 1st job as input and
> > doing the group by.
> >
> > SELECT
> > /*+ MAPJOIN(d) */
> > table.a, sum(table2.b)
> > from table
> > LEFT OUTER JOIN table2
> > ON table.id = table2.id
> > where hour = '2012-12-11 11'
> > group by table.a
> >
> > Why can't this be done within a single map reduce job? As what I can see
> > from the query plan is that all 2nd job mapper do is taking the 1st job's
> > mapper output.
> >
> > --
> > Chen Song
> >
> >
>



-- 
Nitin Pawar


Re: map side join with group by

2012-12-12 Thread Mark Grover
Hi Chen,
I think we would need some more information.

The query is referring to a table called "d" in the MAPJOIN hint but
there is not such table in the query. Moreover, Map joins only make
sense when the right table is the one being "mapped" (in other words,
being kept in memory) in case of a Left Outer Join, similarly if the
left table is the one being "mapped" in case of a Right Outer Join.
Let me know if this is not clear, I'd be happy to offer a better
explanation.

In your query, the where clause on a column called "hour", at this
point I am unsure if that's a column of table1 or table2. If it's
column on table1, that predicate would get pushed up (if you have
hive.optimize.ppd property set to true), so it could possibly be done
in 1 MR job (I am not sure if that's presently the case, you will have
to check the explain plan). If however, the where clause is on a
column in the right table (table2 in your example), it can't be pushed
up since a column of the right table can have different values before
and after the LEFT OUTER JOIN. Therefore, the where clause would need
to be applied in a separate MR job.

This is just my understanding, the full proof answer would lie in
checking out the explain plans and the Semantic Analyzer code.

And for completeness, there is a conditional task (starting Hive 0.7)
that will convert your joins automatically to map joins where
applicable. This can be enabled by enabling hive.auto.convert.join
property.

Mark

On Wed, Dec 12, 2012 at 3:32 PM, Chen Song  wrote:
> I have a silly question on how Hive interpretes a simple query with both map
> side join and group by.
>
> Below query will translate into two jobs, with the 1st one as a map only job
> doing the join and storing the output in a intermediary location, and the
> 2nd one as a map-reduce job taking the output of the 1st job as input and
> doing the group by.
>
> SELECT
> /*+ MAPJOIN(d) */
> table.a, sum(table2.b)
> from table
> LEFT OUTER JOIN table2
> ON table.id = table2.id
> where hour = '2012-12-11 11'
> group by table.a
>
> Why can't this be done within a single map reduce job? As what I can see
> from the query plan is that all 2nd job mapper do is taking the 1st job's
> mapper output.
>
> --
> Chen Song
>
>


map side join with group by

2012-12-12 Thread Chen Song
I have a silly question on how Hive interpretes a simple query with both
map side join and group by.

Below query will translate into two jobs, with the 1st one as a map only
job doing the join and storing the output in a intermediary location, and
the 2nd one as a map-reduce job taking the output of the 1st job as input
and doing the group by.

SELECT
/*+ MAPJOIN(d) */
table.a, sum(table2.b)
from table
LEFT OUTER JOIN table2
ON table.id = table2.id
where hour = '2012-12-11 11'
group by table.a

Why can't this be done within a single map reduce job? As what I can see
from the query plan is that all 2nd job mapper do is taking the 1st job's
mapper output.

-- 
Chen Song


Re: Map side join

2012-12-12 Thread Souvik Banerjee
Hi Bejoy,

Yes I ran the pi example. It was fine.
Regarding the HIVE Job what I found is that it took 4 hrs for the first map
job to get completed.
Those map tasks were doing their job and only reported status after
completion. It is indeed taking too long time to finish. Nothing I could
find relevant in the logs.

Thanks and regards,
Souvik.

On Wed, Dec 12, 2012 at 8:04 AM,  wrote:

> **
> Hi Souvik
>
> Apart from hive jobs is the normal mapreduce jobs like the wordcount
> running fine on your cluster?
>
> If it is working, for the hive jobs are you seeing anything skeptical in
> task, Tasktracker or jobtracker logs?
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Souvik Banerjee 
> *Date: *Tue, 11 Dec 2012 17:12:20 -0600
> *To: *; 
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: Map side join
>
> Hello Everybody,
>
> Need help in for on HIVE join. As we were talking about the Map side join
> I tried that.
> I set the flag set hive.auto.convert.join=true;
>
> I saw Hive converts the same to map join while launching the job. But the
> problem is that none of the map job progresses in my case. I made the
> dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
> done very quickly.
> No luck with any change of settings.
> Failing to progress with the default setting changes these settings.
> set hive.mapred.local.mem=1024; // Initially it was 216 I guess
> set hive.join.cache.size=10; // Initialliu it was 25000
>
> Also on Hadoop side I made this changes
>
> mapred.child.java.opts -Xmx1073741824
>
> But I don't see any progress. After more than 40 minutes of run I am at 0%
> map completion state.
> Can you please throw some light on this?
>
> Thanks a lot once again.
>
> Regards,
> Souvik.
>
>
>
> On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee 
> wrote:
>
>> Hi Bejoy,
>>
>> That's wonderful. Thanks for your reply.
>> What I was wondering if HIVE can do map side join with more than one
>> condition on JOIN clause.
>> I'll simply try it out and post the result.
>>
>> Thanks once again.
>>
>> Regards,
>> Souvik.
>>
>>  On Fri, Dec 7, 2012 at 2:10 PM,  wrote:
>>
>>> **
>>> Hi Souvik
>>>
>>> In earlier versions of hive you had to give the map join hint. But in
>>> later versions just set hive.auto.convert.join = true;
>>> Hive automatically selects the smaller table. It is better to give the
>>> smaller table as the first one in join.
>>>
>>> You can use a map join if you are joining a small table with a large
>>> one, in terms of data size. By small, better to have the smaller table size
>>> in range of MBs.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> --
>>> *From: *Souvik Banerjee 
>>> *Date: *Fri, 7 Dec 2012 13:58:25 -0600
>>> *To: *
>>> *ReplyTo: *user@hive.apache.org
>>> *Subject: *Map side join
>>>
>>> Hello everybody,
>>>
>>> I have got a question. I didn't came across any post which says
>>> somethign about this.
>>> I have got two tables. Lets say A and B.
>>> I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
>>> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
>>> B.id2) AND (A.id3 = B.id3)
>>>
>>> Can I ask HIVE to use map side join in this scenario? Should I give a
>>> hint to HIVE by saying /*+mapjoin(B)*/
>>>
>>> Get back to me if you want any more information in this regard.
>>>
>>> Thanks and regards,
>>> Souvik.
>>>
>>
>>
>


Re: Map side join

2012-12-12 Thread bejoy_ks
Hi Souvik

Apart from hive jobs is the normal mapreduce jobs like the wordcount running 
fine on your cluster?

If it is working, for the hive jobs are you seeing anything skeptical in task, 
Tasktracker or jobtracker logs?


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee 
Date: Tue, 11 Dec 2012 17:12:20 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: Map side join

Hello Everybody,

Need help in for on HIVE join. As we were talking about the Map side join I
tried that.
I set the flag set hive.auto.convert.join=true;

I saw Hive converts the same to map join while launching the job. But the
problem is that none of the map job progresses in my case. I made the
dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
done very quickly.
No luck with any change of settings.
Failing to progress with the default setting changes these settings.
set hive.mapred.local.mem=1024; // Initially it was 216 I guess
set hive.join.cache.size=10; // Initialliu it was 25000

Also on Hadoop side I made this changes

mapred.child.java.opts -Xmx1073741824

But I don't see any progress. After more than 40 minutes of run I am at 0%
map completion state.
Can you please throw some light on this?

Thanks a lot once again.

Regards,
Souvik.



On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee wrote:

> Hi Bejoy,
>
> That's wonderful. Thanks for your reply.
> What I was wondering if HIVE can do map side join with more than one
> condition on JOIN clause.
> I'll simply try it out and post the result.
>
> Thanks once again.
>
> Regards,
> Souvik.
>
>  On Fri, Dec 7, 2012 at 2:10 PM,  wrote:
>
>> **
>> Hi Souvik
>>
>> In earlier versions of hive you had to give the map join hint. But in
>> later versions just set hive.auto.convert.join = true;
>> Hive automatically selects the smaller table. It is better to give the
>> smaller table as the first one in join.
>>
>> You can use a map join if you are joining a small table with a large one,
>> in terms of data size. By small, better to have the smaller table size in
>> range of MBs.
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ----------
>> *From: *Souvik Banerjee 
>> *Date: *Fri, 7 Dec 2012 13:58:25 -0600
>> *To: *
>> *ReplyTo: *user@hive.apache.org
>> *Subject: *Map side join
>>
>> Hello everybody,
>>
>> I have got a question. I didn't came across any post which says somethign
>> about this.
>> I have got two tables. Lets say A and B.
>> I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
>> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
>> B.id2) AND (A.id3 = B.id3)
>>
>> Can I ask HIVE to use map side join in this scenario? Should I give a
>> hint to HIVE by saying /*+mapjoin(B)*/
>>
>> Get back to me if you want any more information in this regard.
>>
>> Thanks and regards,
>> Souvik.
>>
>
>



Re: Map side join

2012-12-11 Thread Souvik Banerjee
Hello Everybody,

Need help in for on HIVE join. As we were talking about the Map side join I
tried that.
I set the flag set hive.auto.convert.join=true;

I saw Hive converts the same to map join while launching the job. But the
problem is that none of the map job progresses in my case. I made the
dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
done very quickly.
No luck with any change of settings.
Failing to progress with the default setting changes these settings.
set hive.mapred.local.mem=1024; // Initially it was 216 I guess
set hive.join.cache.size=10; // Initialliu it was 25000

Also on Hadoop side I made this changes

mapred.child.java.opts -Xmx1073741824

But I don't see any progress. After more than 40 minutes of run I am at 0%
map completion state.
Can you please throw some light on this?

Thanks a lot once again.

Regards,
Souvik.



On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee wrote:

> Hi Bejoy,
>
> That's wonderful. Thanks for your reply.
> What I was wondering if HIVE can do map side join with more than one
> condition on JOIN clause.
> I'll simply try it out and post the result.
>
> Thanks once again.
>
> Regards,
> Souvik.
>
>  On Fri, Dec 7, 2012 at 2:10 PM,  wrote:
>
>> **
>> Hi Souvik
>>
>> In earlier versions of hive you had to give the map join hint. But in
>> later versions just set hive.auto.convert.join = true;
>> Hive automatically selects the smaller table. It is better to give the
>> smaller table as the first one in join.
>>
>> You can use a map join if you are joining a small table with a large one,
>> in terms of data size. By small, better to have the smaller table size in
>> range of MBs.
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ----------
>> *From: *Souvik Banerjee 
>> *Date: *Fri, 7 Dec 2012 13:58:25 -0600
>> *To: *
>> *ReplyTo: *user@hive.apache.org
>> *Subject: *Map side join
>>
>> Hello everybody,
>>
>> I have got a question. I didn't came across any post which says somethign
>> about this.
>> I have got two tables. Lets say A and B.
>> I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
>> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
>> B.id2) AND (A.id3 = B.id3)
>>
>> Can I ask HIVE to use map side join in this scenario? Should I give a
>> hint to HIVE by saying /*+mapjoin(B)*/
>>
>> Get back to me if you want any more information in this regard.
>>
>> Thanks and regards,
>> Souvik.
>>
>
>


Re: Map side join

2012-12-07 Thread Souvik Banerjee
Hi Bejoy,

That's wonderful. Thanks for your reply.
What I was wondering if HIVE can do map side join with more than one
condition on JOIN clause.
I'll simply try it out and post the result.

Thanks once again.

Regards,
Souvik.

On Fri, Dec 7, 2012 at 2:10 PM,  wrote:

> **
> Hi Souvik
>
> In earlier versions of hive you had to give the map join hint. But in
> later versions just set hive.auto.convert.join = true;
> Hive automatically selects the smaller table. It is better to give the
> smaller table as the first one in join.
>
> You can use a map join if you are joining a small table with a large one,
> in terms of data size. By small, better to have the smaller table size in
> range of MBs.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: *Souvik Banerjee 
> *Date: *Fri, 7 Dec 2012 13:58:25 -0600
> *To: *
> *ReplyTo: *user@hive.apache.org
> *Subject: *Map side join
>
> Hello everybody,
>
> I have got a question. I didn't came across any post which says somethign
> about this.
> I have got two tables. Lets say A and B.
> I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
> B.id2) AND (A.id3 = B.id3)
>
> Can I ask HIVE to use map side join in this scenario? Should I give a hint
> to HIVE by saying /*+mapjoin(B)*/
>
> Get back to me if you want any more information in this regard.
>
> Thanks and regards,
> Souvik.
>


Re: Map side join

2012-12-07 Thread bejoy_ks
Hi Souvik

In earlier versions of hive you had to give the map join hint. But in later 
versions just set hive.auto.convert.join = true;
Hive automatically selects the smaller table. It is better to give the smaller 
table as the first  one in join.

You can use a map join if you are joining a small table with a large one, in 
terms of data size. By small, better to have the smaller table size in range of 
MBs.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee 
Date: Fri, 7 Dec 2012 13:58:25 
To: 
Reply-To: user@hive.apache.org
Subject: Map side join

Hello everybody,

I have got a question. I didn't came across any post which says somethign
about this.
I have got two tables. Lets say A and B.
I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
B.id2) AND (A.id3 = B.id3)

Can I ask HIVE to use map side join in this scenario? Should I give a hint
to HIVE by saying /*+mapjoin(B)*/

Get back to me if you want any more information in this regard.

Thanks and regards,
Souvik.



Map side join

2012-12-07 Thread Souvik Banerjee
Hello everybody,

I have got a question. I didn't came across any post which says somethign
about this.
I have got two tables. Lets say A and B.
I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
B.id2) AND (A.id3 = B.id3)

Can I ask HIVE to use map side join in this scenario? Should I give a hint
to HIVE by saying /*+mapjoin(B)*/

Get back to me if you want any more information in this regard.

Thanks and regards,
Souvik.


Re: Map side join and Serde jar in distributed cache missing

2012-09-24 Thread Aniket Mokashi
Just a guess- Put your jar on hadoop classpath.

On Mon, Sep 24, 2012 at 5:45 PM, Abhishek Pratap Singh
wrote:

> i m using hive-0.7.1
>
>
> On Mon, Sep 24, 2012 at 5:10 PM, Edward Capriolo wrote:
>
>> I have noticed this as well with hive 0.7.0. Not sure what CDH is
>> based on but newer versions could suffer as well. What version of hive
>> do you have?
>>
>> On Mon, Sep 24, 2012 at 7:30 PM, Abhishek Pratap Singh
>>  wrote:
>> > Hi all,
>> >
>> > I have enabled automatic Map join for any table less than 50MB. This
>> table
>> > needs a custom serde, which is added every time using add Jars(size of
>> jar
>> > is 25KB) in hive.
>> > The problem is when hive performs map side join, classes in that serde
>> jar
>> > is not loaded and class not found exception is thrown. But if I disable
>> map
>> > side join, it works perfectly fine which showcase that the distributed
>> cache
>> > is working and serde jar is available.
>> > Any idea what is happening in Map side join? I m using CDH3
>> >
>> >
>> > Regards,
>> > Abhishek
>> >
>>
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"


Re: Map side join and Serde jar in distributed cache missing

2012-09-24 Thread Abhishek Pratap Singh
i m using hive-0.7.1

On Mon, Sep 24, 2012 at 5:10 PM, Edward Capriolo wrote:

> I have noticed this as well with hive 0.7.0. Not sure what CDH is
> based on but newer versions could suffer as well. What version of hive
> do you have?
>
> On Mon, Sep 24, 2012 at 7:30 PM, Abhishek Pratap Singh
>  wrote:
> > Hi all,
> >
> > I have enabled automatic Map join for any table less than 50MB. This
> table
> > needs a custom serde, which is added every time using add Jars(size of
> jar
> > is 25KB) in hive.
> > The problem is when hive performs map side join, classes in that serde
> jar
> > is not loaded and class not found exception is thrown. But if I disable
> map
> > side join, it works perfectly fine which showcase that the distributed
> cache
> > is working and serde jar is available.
> > Any idea what is happening in Map side join? I m using CDH3
> >
> >
> > Regards,
> > Abhishek
> >
>


Re: Map side join and Serde jar in distributed cache missing

2012-09-24 Thread Edward Capriolo
I have noticed this as well with hive 0.7.0. Not sure what CDH is
based on but newer versions could suffer as well. What version of hive
do you have?

On Mon, Sep 24, 2012 at 7:30 PM, Abhishek Pratap Singh
 wrote:
> Hi all,
>
> I have enabled automatic Map join for any table less than 50MB. This table
> needs a custom serde, which is added every time using add Jars(size of jar
> is 25KB) in hive.
> The problem is when hive performs map side join, classes in that serde jar
> is not loaded and class not found exception is thrown. But if I disable map
> side join, it works perfectly fine which showcase that the distributed cache
> is working and serde jar is available.
> Any idea what is happening in Map side join? I m using CDH3
>
>
> Regards,
> Abhishek
>


Map side join and Serde jar in distributed cache missing

2012-09-24 Thread Abhishek Pratap Singh
Hi all,

I have enabled automatic Map join for any table less than 50MB. This table
needs a custom serde, which is added every time using add Jars(size of jar
is 25KB) in hive.
The problem is when hive performs map side join, classes in that serde jar
is not loaded and class not found exception is thrown. But if I disable map
side join, it works perfectly fine which showcase that the distributed
cache is working and serde jar is available.
Any idea what is happening in Map side join? I m using CDH3


Regards,
Abhishek


Re: Map side join

2012-06-18 Thread Aniket Mokashi
Hive also have something called uniquejoin. May be you are looking for
that. I cannot find documentation for your reference but you can do a jira
search.
It allows you to perform joining multiple sources with same key, mapside.
(all sources should have the same key)

~Aniket

On Wed, Jun 13, 2012 at 8:20 AM, Tucker, Matt wrote:

> Hi,
>
> Assuming that 4 tables are small enough to fit in the Distributed Cache,
> the joins between the tables all need to join against a common key.
>
> Example:
> set hive.auto.convert.join=true;
> SELECT *
> FROM large
>JOIN smalla ON
>large.key = smalla.key1
>JOIN smallb ON
>large.key = smallb.key2
>JOIN smallc ON
>large.key = smallc.key3
>JOIN smalld ON
>large.key = smalld.key4;
>
> Having a different join key will push the join off into a different task,
> as will the order of the join condition. In this example, large.key was
> always on the left side of the join conditions.
>
>
> Matt Tucker
>
> -Original Message-
> From: Abhishek [mailto:abhishek.dod...@gmail.com]
> Sent: Wednesday, June 13, 2012 11:13 AM
> To: user@hive.apache.org
> Subject: Map side join
>
> Hi all,
>
> How map side join in hive, can be used to join multiple tables(suppose 5
> tables).
>
> Regards
> Abhishek
>
> Sent from my iPhone
>



-- 
"...:::Aniket:::... Quetzalco@tl"


RE: Map side join

2012-06-13 Thread Tucker, Matt
Hi,

Assuming that 4 tables are small enough to fit in the Distributed Cache, the 
joins between the tables all need to join against a common key.

Example:
set hive.auto.convert.join=true;
SELECT *
FROM large
JOIN smalla ON
large.key = smalla.key1
JOIN smallb ON
large.key = smallb.key2
JOIN smallc ON
large.key = smallc.key3
JOIN smalld ON
large.key = smalld.key4;

Having a different join key will push the join off into a different task, as 
will the order of the join condition. In this example, large.key was always on 
the left side of the join conditions.


Matt Tucker

-Original Message-
From: Abhishek [mailto:abhishek.dod...@gmail.com] 
Sent: Wednesday, June 13, 2012 11:13 AM
To: user@hive.apache.org
Subject: Map side join

Hi all,

How map side join in hive, can be used to join multiple tables(suppose 5 
tables).

Regards
Abhishek 

Sent from my iPhone


Map side join

2012-06-13 Thread Abhishek
Hi all,

How map side join in hive, can be used to join multiple tables(suppose 5 
tables).

Regards
Abhishek 

Sent from my iPhone

Re: Hive map side join with Distributed cache

2012-06-11 Thread Harsh J
Hey Abhishek,

Hive manages your dist-cache automatically. What issue are you running
into when trying to map-join, that you wish to solve?

MapJoin docs can be found here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

P.s. Also moved this to user@hive.apache.org. BCC'd common-user@ and CC'd you.

On Mon, Jun 11, 2012 at 9:46 PM, abhishek dodda
 wrote:
> hi all,
>
> Map side join with distributed cache how to do this? can any one help
> me on this.
>
> Regards
> Abhishek.



-- 
Harsh J