Re: MultithreadedMapper - Sharing Data Structure

2015-08-24 Thread Harsh J
The MultiThreadedMapper won't solve your problem, as all it does is run
parallel maps within the same map task JVM as a non-MT one. Your data
structure won't be shared across the different map task JVMs on the host,
but just within the map tasks's own multiple threads running the map()
function over input records.

Wouldn't doing reduce-side join for larger files be much faster?

On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes  wrote:

> I am developig a job that has 30B of records in the input path. (File A)
> I need to filter these records using another file that can have 30K to
> 180M of records. (File B)
> So fo each record in File A, i will make a lookup in File B.
> I am using distributed cache to share the File B. The problem is that if
> the File B is too large (for example 180 M of records), i spend too much
> time (CPU processing) allocating it in a hashmap. I make this allocation to
> each map task.
>
> In hadoop 2.X the jvm reuse was discontinued. So i am think in use 
> MultithreadedMapper,
> making the hashmap thread-safe, and sharing this read-only structure across
> the mappers.
>
> Is this a good approach?
>
>
>
>


Re: MultithreadedMapper - Sharing Data Structure

2015-08-24 Thread twinkle sachdeva
Hi,

We have been using the jvm reuse feature for the same reason of sharing the
same structure across multiple Map Tasks. Multithreaded Map task does that
partially, as within the multiple threads, same copy is used.


Depending upon the hardware availability, one can get the same performance.

Thanks,


On Mon, Aug 24, 2015 at 1:37 PM, Harsh J  wrote:

> The MultiThreadedMapper won't solve your problem, as all it does is run
> parallel maps within the same map task JVM as a non-MT one. Your data
> structure won't be shared across the different map task JVMs on the host,
> but just within the map tasks's own multiple threads running the map()
> function over input records.
>
> Wouldn't doing reduce-side join for larger files be much faster?
>
> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes 
> wrote:
>
>> I am developig a job that has 30B of records in the input path. (File A)
>> I need to filter these records using another file that can have 30K to
>> 180M of records. (File B)
>> So fo each record in File A, i will make a lookup in File B.
>> I am using distributed cache to share the File B. The problem is that if
>> the File B is too large (for example 180 M of records), i spend too much
>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>> each map task.
>>
>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use 
>> MultithreadedMapper,
>> making the hashmap thread-safe, and sharing this read-only structure across
>> the mappers.
>>
>> Is this a good approach?
>>
>>
>>
>>
>


Re: MultithreadedMapper - Sharing Data Structure

2015-08-24 Thread Harsh J
Perhaps combining MultiThreaded mapper along with a CombineFileInputFormat
may help (it reduces total # of maps, but you get more threads per map
task).

On Mon, Aug 24, 2015 at 2:16 PM twinkle sachdeva 
wrote:

> Hi,
>
> We have been using the jvm reuse feature for the same reason of sharing
> the same structure across multiple Map Tasks. Multithreaded Map task does
> that partially, as within the multiple threads, same copy is used.
>
>
> Depending upon the hardware availability, one can get the same performance.
>
> Thanks,
>
>
> On Mon, Aug 24, 2015 at 1:37 PM, Harsh J  wrote:
>
>> The MultiThreadedMapper won't solve your problem, as all it does is run
>> parallel maps within the same map task JVM as a non-MT one. Your data
>> structure won't be shared across the different map task JVMs on the host,
>> but just within the map tasks's own multiple threads running the map()
>> function over input records.
>>
>> Wouldn't doing reduce-side join for larger files be much faster?
>>
>> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes 
>> wrote:
>>
>>> I am developig a job that has 30B of records in the input path. (File A)
>>> I need to filter these records using another file that can have 30K to
>>> 180M of records. (File B)
>>> So fo each record in File A, i will make a lookup in File B.
>>> I am using distributed cache to share the File B. The problem is that if
>>> the File B is too large (for example 180 M of records), i spend too much
>>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>>> each map task.
>>>
>>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use 
>>> MultithreadedMapper,
>>> making the hashmap thread-safe, and sharing this read-only structure across
>>> the mappers.
>>>
>>> Is this a good approach?
>>>
>>>
>>>
>>>
>>
>


Appending file makes Hadoop cluster out of storage.

2015-08-24 Thread Quan Nguyen Hong
Hi all, 

 

Have a good day!

 

I used these below code to append file in HDFS from a local file.

 

The local file size is 85MB.

The Hadoop cluster (CDH 5.4.2, hdfs 2.6, replica number is 3) has 140GB
free.

I have a while loop, in there I do

FSDataOutputStream out = fs.append(outFile);

   out.write(buffer, 0, bytesRead);

   out.close();

Each time I append 1024 byte from local file to HDFS file, the above loop
makes my cluster out of storage and my program couldn't finished yet.

 

Here's the full code.

import java.io.*;

import java.net.URI;

import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.*;

public class writeflushexisted {

 public static void main(String[] argv) throws IOException,
URISyntaxException {

   Configuration conf = new Configuration();

   FileSystem fs = FileSystem.get(new URI(
"hdfs://192.168.94.185:8020" ),conf);

 

   Path inFile = new Path("testdata.txt");

   Path outFile = new Path("/myhdfs/testdata.txt");

 

   File localFile = new File(inFile.toString());

   // Read from and write to new file

   FileInputStream in = new
FileInputStream(localFile);

   

   int i = 0;

   byte buffer[] = new byte[1024];

   try {

int bytesRead = 0;

while ((bytesRead = in.read(buffer))
> 0) {

 FSDataOutputStream out
= fs.append(outFile);

 out.write(buffer, 0,
bytesRead);

 out.close();

 i++;

}

   } catch (IOException e) {

System.out.println("Error while
copying file: " + e.getMessage());



   } finally {

in.close();

System.out.println("Number of loop:"
+ i);

   }

 }

}

 

Here's the information before I run this code

---

[hdfs@chdhost125 current]$ hadoop fs -df -h

Filesystem SizeUsed  Available  Use%

hdfs://chdhost185.vitaldev.com:8020  266.4 G  38.2 G139.8 G   14%

---

[hdfs@chdhost125 lib]$ hadoop fs -du -h /

67.7 M  1.3 G   /hbase

0   0   /myhdfs

0   0   /solr

1.8 G   5.4 G   /tmp

10.6 G  31.4 G  /user

 

 

Here's the information while above code was running 

---

Filesystem Size Used  Available
Use%

hdfs://chdhost185.vitaldev.com:8020  266.4 G  170.2 G 95.9 G   64%

---

[hdfs@chdhost125 lib]$ hadoop fs -du -h /

67.7 M  1.3 G   /hbase

32.9 M  384 M   /myhdfs

0   0   /solr

1.8 G   5.4 G   /tmp

10.6 G  31.4 G  /user

 

 

After 10 minutes, my cluster is out of storage and my program throw
exception with error: "Error while copying file: Failed to replace a bad
datanode on the existing pipeline due to no more good datanodes being
available to try. (Nodes: current=[192.168.94.185:50010,
192.168.94.27:50010], original=[192.168.94.185:50010, 192.168.94.27:50010]).
The current failed datanode replacement policy is DEFAULT, and a client may
configure this via
'dfs.client.block.write.replace-datanode-on-failure.policy' in its
configuration."

 

So, why append file with little append size (1024 byte) make my cluster out
of space (local file is 85MB, but hdfs consumes ~ 140GB to append file)?

 

Is any problem with my code? I know that append file with small size is not
recommend, but I just want to know the reason why hdfs consume so much
space.

 

 

Thanks and Regards,

Quan Nguyen

 



yarn groups issue with ambari/hortonworks

2015-08-24 Thread REYANE OUKPEDJO
Hi there,
I faced some  weird issue and I think there is no need to give many details 
other than the actual issue.
When I add the user yarn to another group, I could see from within the same 
shell where  I added him that yarn belongs to the new group . But after 
restarting the node manager through ambari and open a new shell , I can see 
that yarn user no longer belongs to the new group. My question is whether 
ambari is removoig the yarn user from any group other than it's primary one. 
Any response to this will be appreciated.
Thanks
Reyane OUKPEDJO



Questions with regards to Yarn/Hadoop

2015-08-24 Thread Omid Alipourfard
Hi,

I am running a Terasort benchmark (10 GB, 25 reducers, 50 mappers) that
comes with Hadoop 2.7.1.  I am experiencing an unexpected behavior with
Yarn, which I am hoping someone can shed some light on:

I have a cluster of three machines with 2 cores and 3.75 GB of RAM (per
machine), when I run the Terasort job, one of the machines is idling, i.e.,
it is not using any substantial Disk or CPU.  All three machines are
capable of executing jobs, and one of the machines is both a name node and
a data node.

On the other hand, running the same job on a cluster of three machines with
2 cores and 8 GB of RAM (per machine) utilizes all the machines.

Both setups are using the same Hadoop configuration files, in both of them
mapper tasks have 1 GB and reducer tasks have 2 GB of memory.

I am guessing Yarn is not utilizing the machines correctly -- maybe because
of the available amount of RAM, but I am not sure how to verify this.

Any thoughts on what the problem might be or how to verify it is
appreciated,
Thanks,
Omid

P.S. I can also post any of the logs or configuration files.


RE: Questions with regards to Yarn/Hadoop

2015-08-24 Thread Naganarasimha G R (Naga)
Hi Omid,
Seems like the machine which was running slow might have the AM container also 
and possibly 2GB is assigned to it.
Can you share the following details :
* Memory configuration of AM container
* Containers which are running in the idling machine, is it always the same 
machine or different everytime. are there any other processes running in that 
machine if always same
* Job counters for both the runs will also provide useful information please 
share across.

Regards,
+ Naga



From: Omid Alipourfard [alipo...@usc.edu]
Sent: Tuesday, August 25, 2015 07:23
To: user@hadoop.apache.org
Subject: Questions with regards to Yarn/Hadoop

Hi,

I am running a Terasort benchmark (10 GB, 25 reducers, 50 mappers) that comes 
with Hadoop 2.7.1.  I am experiencing an unexpected behavior with Yarn, which I 
am hoping someone can shed some light on:

I have a cluster of three machines with 2 cores and 3.75 GB of RAM (per 
machine), when I run the Terasort job, one of the machines is idling, i.e., it 
is not using any substantial Disk or CPU.  All three machines are capable of 
executing jobs, and one of the machines is both a name node and a data node.

On the other hand, running the same job on a cluster of three machines with 2 
cores and 8 GB of RAM (per machine) utilizes all the machines.

Both setups are using the same Hadoop configuration files, in both of them 
mapper tasks have 1 GB and reducer tasks have 2 GB of memory.

I am guessing Yarn is not utilizing the machines correctly -- maybe because of 
the available amount of RAM, but I am not sure how to verify this.

Any thoughts on what the problem might be or how to verify it is appreciated,
Thanks,
Omid

P.S. I can also post any of the logs or configuration files.