Anatomy of read in hdfs

2017-04-06 Thread Sidharth Kumar
Hi Genies,

I have a small doubt that hdfs read operation is parallel or sequential
process. Because from my understanding it should be parallel but if I read
"hadoop definitive guide 4" in anatomy of read it says "*Data is streamed
from the datanode back **to the client, which calls read() repeatedly on
the stream (step 4). When the end of the **block is reached, DFSInputStream
will close the connection to the datanode, then find **the best datanode
for the next block (step 5). This happens transparently to the client, **which
from its point of view is just reading a continuous stream*."

So can you kindly explain me how read operation will exactly happens.


Thanks for your help in advance

Sidharth


Customize Sqoop default property

2017-04-06 Thread Sidharth Kumar
Hi,

I am importing data from RDBMS to hadoop using sqoop but my RDBMS data is
multi valued and contains "," special character.
So, While importing data using sqoop into hadoop ,sqoop by default it
separate the columns by using "," character. Is there any property through
which we can customize this  character from  "," to "|" or any other
special character which is not a part of data.

Thanks
Sidharth


Re: Physical memory (bytes) snapshot counter question - how to get maximum memory used in reduce task

2017-04-06 Thread Miklos Szegedi
There are two new counters, MAP_PHYSICAL_MEMORY_BYTES_MAX and
REDUCE_PHYSICAL_MEMORY_BYTES_MAX that give you the max value for map and
reduce respectively.

Thanks,
Miklos

On Wed, Apr 5, 2017 at 6:37 PM, Aaron Eng  wrote:

> An important consideration is the difference between the RSS of the JVM
> process vs. the used heap size.  Which of those are you looking for? And
> also, importantly, why/what do you plan to do with that info?
>
> A second important consideration is the length of time you are at/around
> your max RSS/java heap.  Holding X MB of memory for 100ms is very different
> from holding X MB of memory for 100 seconds.  Are you looking for that
> info? And if so, how do you plan to use it?
>
> > On Apr 5, 2017, at 6:15 PM, Nico Pappagianis <
> nico.pappagia...@salesforce.com> wrote:
> >
> > Hi all
> >
> > I've made some memory optimizations on the reduce task and I would like
> to compare the old reducer vs new reducer in terms of maximum memory
> consumption.
> >
> > I have a question regarding the description of the following counter:
> >
> > PHYSICAL_MEMORY_BYTES | Physical memory (bytes) snapshot | Total
> physical memory used by all tasks including spilled data.
> >
> > I'm assuming this means the aggregate of memory used throughout the
> entire reduce task (if viewing at the reduce task-level).
> > Please correct me if I'm wrong on this assumption (the description seems
> pretty straightforward).
> >
> > Is there a way to get the maximum (not total) memory used by a reduce
> task from the default counters?
> >
> > Thanks!
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: Physical memory (bytes) snapshot counter question - how to get maximum memory used in reduce task

2017-04-06 Thread Aaron Eng
An important consideration is the difference between the RSS of the JVM process 
vs. the used heap size.  Which of those are you looking for? And also, 
importantly, why/what do you plan to do with that info?

A second important consideration is the length of time you are at/around your 
max RSS/java heap.  Holding X MB of memory for 100ms is very different from 
holding X MB of memory for 100 seconds.  Are you looking for that info? And if 
so, how do you plan to use it?

> On Apr 5, 2017, at 6:15 PM, Nico Pappagianis 
>  wrote:
> 
> Hi all
> 
> I've made some memory optimizations on the reduce task and I would like to 
> compare the old reducer vs new reducer in terms of maximum memory consumption.
> 
> I have a question regarding the description of the following counter:
> 
> PHYSICAL_MEMORY_BYTES | Physical memory (bytes) snapshot | Total physical 
> memory used by all tasks including spilled data.
> 
> I'm assuming this means the aggregate of memory used throughout the entire 
> reduce task (if viewing at the reduce task-level). 
> Please correct me if I'm wrong on this assumption (the description seems 
> pretty straightforward).
> 
> Is there a way to get the maximum (not total) memory used by a reduce task 
> from the default counters?
> 
> Thanks!
> 
> 
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org