Re: How to load raw log file into HDFS?

2012-05-14 Thread rdaley

If you are a novice I'd suggest using a visual design tool like Pentaho
Kettle   http://wiki.pentaho.com/display/BAD/Loading+Data+into+HDFS  How To
Load Data into HDFS 

AnExplorer wrote:
> 
> Hi, I am novice in Hadoop. Kindly suggest how do we load log files into
> hdfs. Please suggest the command and steps.
> Thanks in advance!!
> 

-- 
View this message in context: 
http://old.nabble.com/How-to-load-raw-log-file-into-HDFS--tp33815208p33832683.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Terasort

2012-05-14 Thread Barry, Sean F
I am having a bit of trouble understanding how the Terasort benchmark works, 
especially the fundamentals of how the data is sorted. If the data is being 
split into many chunks wouldn't it all have to be re-integrated back into the 
entire dataset?

And since a terabyte is huge wouldn't it take a very long time. I seem to be 
missing a few crucial steps in the process and if someone could help me 
understand how terasort is working that would be great. Any papers or videos on 
this topic would be greatly appreciated.



-SB



Re: Namenode EOF Exception

2012-05-14 Thread Harsh J
Your fsimage seems to have gone bad (is it 0-sized? I recall that as a
known issue long since fixed).

The easiest way is to fall back to the last available good checkpoint
(From SNN). Or if you have multiple dfs.name.dirs, see if some of the
other points have better/complete files on them, and re-spread them
across after testing them out (and backing up the originals).

Though what version are you running? Cause AFAIK most of the recent
stable versions/distros include NN resource monitoring threads which
should have placed your NN into safemode the moment all its disks ran
near to out of space.

On Mon, May 14, 2012 at 10:50 PM, Prashant Kommireddi
 wrote:
> Hi,
>
> I am seeing an issue where Namenode does not start due an EOFException. The
> disk was full and I cleared space up but I am unable to get past this
> exception. Any ideas on how this can be resolved?
>
> 2012-05-14 10:10:44,018 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hadoop
> 2012-05-14 10:10:44,018 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> isPermissionEnabled=false
> 2012-05-14 10:10:44,023 INFO
> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
> Initializing FSNamesystemMetrics using context
> object:org.apache.hadoop.metrics.file.FileContext
> 2012-05-14 10:10:44,024 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
> FSNamesystemStatusMBean
> 2012-05-14 10:10:44,047 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files = 205470
> 2012-05-14 10:10:44,844 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> initialization failed.
> java.io.EOFException
>    at java.io.DataInputStream.readFully(DataInputStream.java:180)
>    at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:292)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:279)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> 2012-05-14 10:10:44,845 INFO org.apache.hadoop.ipc.Server: Stopping server
> on 54310
> 2012-05-14 10:10:44,845 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
>    at java.io.DataInputStream.readFully(DataInputStream.java:180)
>    at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>    at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:292)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:279)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>
> 2012-05-14 10:10:44,846 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at
> gridforce-1.internal.salesforce.com/10.0.201.159
> /



-- 
Harsh J


Re: Namenode EOF Exception

2012-05-14 Thread Prashant Kommireddi
Thanks Harsh. I am using 0.20.2, I see on the Jira this issue was
fixed for 0.23?

I will try out your suggestions and get back.

On May 14, 2012, at 1:22 PM, Harsh J  wrote:

> Your fsimage seems to have gone bad (is it 0-sized? I recall that as a
> known issue long since fixed).
>
> The easiest way is to fall back to the last available good checkpoint
> (From SNN). Or if you have multiple dfs.name.dirs, see if some of the
> other points have better/complete files on them, and re-spread them
> across after testing them out (and backing up the originals).
>
> Though what version are you running? Cause AFAIK most of the recent
> stable versions/distros include NN resource monitoring threads which
> should have placed your NN into safemode the moment all its disks ran
> near to out of space.
>
> On Mon, May 14, 2012 at 10:50 PM, Prashant Kommireddi
>  wrote:
>> Hi,
>>
>> I am seeing an issue where Namenode does not start due an EOFException. The
>> disk was full and I cleared space up but I am unable to get past this
>> exception. Any ideas on how this can be resolved?
>>
>> 2012-05-14 10:10:44,018 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hadoop
>> 2012-05-14 10:10:44,018 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>> isPermissionEnabled=false
>> 2012-05-14 10:10:44,023 INFO
>> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
>> Initializing FSNamesystemMetrics using context
>> object:org.apache.hadoop.metrics.file.FileContext
>> 2012-05-14 10:10:44,024 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
>> FSNamesystemStatusMBean
>> 2012-05-14 10:10:44,047 INFO org.apache.hadoop.hdfs.server.common.Storage:
>> Number of files = 205470
>> 2012-05-14 10:10:44,844 ERROR
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
>> initialization failed.
>> java.io.EOFException
>>at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:292)
>>at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>>at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:279)
>>at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>>at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>> 2012-05-14 10:10:44,845 INFO org.apache.hadoop.ipc.Server: Stopping server
>> on 54310
>> 2012-05-14 10:10:44,845 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
>>at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>>at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:292)
>>at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>>at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:279)
>>at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>>at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>>
>> 2012-05-14 10:10:44,846 INFO
>> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>> /
>> SHUTDOWN_MSG: Shutting down NameNode at
>> gridforce-1.internal.salesforce.com/10.0.201.159
>> /
>
>
>
> --
> Harsh J


RE: How to load raw log file into HDFS?

2012-05-14 Thread Michael Wang
I have the same question and I am glad to get you guys' help. I am also novice 
in Hadoop :)
I am using pig and hive to analyze the logs. My logs are in . 
Do I need to use "hadoop fs -copyFromLocal" to put files to  
first, and then load data files to pig or hive from ? Or can 
just load logs from Local_file_path directly to pig or hive? After I load the 
files to hive, I found it is put at /user/hive/warehouse. Is 
/user/hive/warehouse a HDFS?
How do I know what  are available? 

-Original Message-
From: Alexander Fahlke [mailto:alexander.fahlke.mailingli...@googlemail.com] 
Sent: Monday, May 14, 2012 1:53 AM
To: common-user@hadoop.apache.org
Subject: Re: How to load raw log file into HDFS?

Hi,

the best would be to read the documentation and some books to get familar
with Hadoop.

One of my favourite books is "Hadoop in Action" from Manning (
http://www.manning.com/lam/)
This book has an exmple for putting (log)-files into HDFS. Check out the
source "listing-3-1"

Later you can also check out Cloudera's Flume:
https://github.com/cloudera/flume/wiki

-- 
BR

Alexander Fahlke
Java Developer
www.nurago.com | www.fahlke.org


On Mon, May 14, 2012 at 7:24 AM, Amith D K  wrote:

> U can even use put/copyFromLocal
>
> both are similar and does the job via terminal.
>
> Or u can write a simple client program to do the job :)
>
> Amith
>
>
> 
> From: samir das mohapatra [samir.help...@gmail.com]
> Sent: Sunday, May 13, 2012 9:13 PM
> To: common-user@hadoop.apache.org
> Subject: Re: How to load raw log file into HDFS?
>
> Hi
> To load any file from local
> Command:
>  syntax: hadoop fs -copyFromLocal
>   Example hadoop fs -copyFromLocal input/logs
> hdfs://localhost/user/dataset/
>
>  More Commans:
> http://hadoop.apache.org/common/docs/r0.17.1/hdfs_shell.html
>
>
> On Sun, May 13, 2012 at 9:53 AM, AnExplorer 
> wrote:
>
> >
> > Hi, I am novice in Hadoop. Kindly suggest how do we load log files into
> > hdfs.
> > Please suggest the command and steps.
> > Thanks in advance!!
> > --
> > View this message in context:
> >
> http://old.nabble.com/How-to-load-raw-log-file-into-HDFS--tp33815208p33815208.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

This electronic message, including any attachments, may contain proprietary, 
confidential or privileged information for the sole use of the intended 
recipient(s). You are hereby notified that any unauthorized disclosure, 
copying, distribution, or use of this message is prohibited. If you have 
received this message in error, please immediately notify the sender by reply 
e-mail and delete it.



Re: Namenode EOF Exception

2012-05-14 Thread Harsh J
True, I don't recall 0.20.2 (the original release that was a few years
ago) carrying these fixes. You ought to upgrade that cluster to the
current stable release for the many fixes you can benefit from :)

On Mon, May 14, 2012 at 11:58 PM, Prashant Kommireddi
 wrote:
> Thanks Harsh. I am using 0.20.2, I see on the Jira this issue was
> fixed for 0.23?
>
> I will try out your suggestions and get back.
>
> On May 14, 2012, at 1:22 PM, Harsh J  wrote:
>
>> Your fsimage seems to have gone bad (is it 0-sized? I recall that as a
>> known issue long since fixed).
>>
>> The easiest way is to fall back to the last available good checkpoint
>> (From SNN). Or if you have multiple dfs.name.dirs, see if some of the
>> other points have better/complete files on them, and re-spread them
>> across after testing them out (and backing up the originals).
>>
>> Though what version are you running? Cause AFAIK most of the recent
>> stable versions/distros include NN resource monitoring threads which
>> should have placed your NN into safemode the moment all its disks ran
>> near to out of space.
>>
>> On Mon, May 14, 2012 at 10:50 PM, Prashant Kommireddi
>>  wrote:
>>> Hi,
>>>
>>> I am seeing an issue where Namenode does not start due an EOFException. The
>>> disk was full and I cleared space up but I am unable to get past this
>>> exception. Any ideas on how this can be resolved?
>>>
>>> 2012-05-14 10:10:44,018 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hadoop
>>> 2012-05-14 10:10:44,018 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>> isPermissionEnabled=false
>>> 2012-05-14 10:10:44,023 INFO
>>> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
>>> Initializing FSNamesystemMetrics using context
>>> object:org.apache.hadoop.metrics.file.FileContext
>>> 2012-05-14 10:10:44,024 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
>>> FSNamesystemStatusMBean
>>> 2012-05-14 10:10:44,047 INFO org.apache.hadoop.hdfs.server.common.Storage:
>>> Number of files = 205470
>>> 2012-05-14 10:10:44,844 ERROR
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
>>> initialization failed.
>>> java.io.EOFException
>>>    at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>    at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:292)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:279)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>>> 2012-05-14 10:10:44,845 INFO org.apache.hadoop.ipc.Server: Stopping server
>>> on 54310
>>> 2012-05-14 10:10:44,845 ERROR
>>> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
>>>    at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>    at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:292)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:279)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>>>    at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>>>
>>> 2012-05-14 10:10:44,846 INFO
>>> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>>> /
>>> SHUTDOWN_MSG: Shutting down NameNode at
>>> gridforce-1.internal.salesforce.com/10.0.201.159
>>> **

Re: Resource underutilization / final reduce tasks only uses half of cluster ( tasktracker map/reduce slots )

2012-05-14 Thread Abhishek Pratap Singh
Hi JD,

Number of reduce task will depend upon the key after all the mapper is
done. if the key is same than all the data will go to one node, similarly
utilization of all nodes of cluster will depend upon the number of
different keys for reduce task.


Regards,
Abhishek

On Fri, May 11, 2012 at 4:57 PM, Jeremy Davis
wrote:

>
> I see mapred.tasktracker.reduce.tasks.maximum and
> mapred.tasktracker.map.tasks.maximum, but I'm wondering if there isn't
> another tuning parameter I need to look at.
>
> I can tune the task tracker so that when I have many jobs running, with
> many simultaneous maps and reduces I utilize 95% of cpu and memory.
>
> Inevitably though I end up with a huge final reduce task that only uses
> half of of my cluster because I have reserved the other half for Mapping.
>
> Is there a way around this problem?
>
> Seems like there should also be a maximum number of reducers conditional
> on no Map tasks running.
>
> -JD


Re: Moving files from JBoss server to HDFS

2012-05-14 Thread Abhishek Pratap Singh
What about using flume to consume the files and send to HDFS? It depends
upon what security implication you are looking for.
It will be easier if you can look into answer of similar question below.

Is Jboss server process is pushing file to HDFS or some other process is
pushing file to HDFS?
The intermediate server, is doing some security check on data or user
access?  What processing is done on intermediate server? if not how it is
different from sending the file directly to HDFS?

Regards,
Abhishek

On Sat, May 12, 2012 at 3:56 AM, samir das mohapatra <
samir.help...@gmail.com> wrote:

> Hi  financeturd ,
>My Point of view  second step like bellow is the good approach
>
> {Separate server} <-- {JBoss server}
> and then
> {Separate server} --> HDFS
>
> thanks
>   samir
>
> On Sat, May 12, 2012 at 6:00 AM, financeturd financeturd <
> financet...@yahoo.com> wrote:
>
> > Hello,
> >
> > We have a large number of
> > custom-generated files (not just web logs) that we need to move from our
> > JBoss servers to HDFS.  Our first implementation ran a cron job every 5
> > minutes to move our files from the "output" directory to HDFS.
> >
> > Is this recommended?  We are being told by our IT team that our JBoss
> > servers should not have access to HDFS for security reasons.  The files
> > must be "sucked" to HDFS by other servers that do not accept traffic
> > from the outside.  In essence, they are asking for a layer of
> > indirection.  Instead of:
> > {JBoss server} --> {HDFS}
> > it's being requested that it look like:
> > {Separate server} <-- {JBoss server}
> > and then
> > {Separate server} --> HDFS
> >
> >
> > While I understand in principle what is being said, the security of
> having
> > processes on JBoss servers writing files to HDFS doesn't seem any worse
> > than having Tomcat servers access a central database, which they do.
> >
> > Can anyone comment on what a recommended approach would be?  Should our
> > JBoss servers push their data to HDFS or should the data be pulled by
> > another server and then placed into HDFS?
> >
> > Thank you!
> > FT
>


Re: Number of Reduce Tasks

2012-05-14 Thread Abhishek Pratap Singh
AFAIK Number of reducers depends upon the Key generated after Mappers are
done. May be join is resulting in one key.

Regards,
Abhishek
On Sun, May 13, 2012 at 10:48 PM, anwar shaikh wrote:

> Hi Everybody,
>
> I am executing a MapReduce job to execute JOIN operation using
> org.apache.hadoop.contrib.utils.join
>
> Four files are given as Input.
>
> I think there are four Map Jobs running (based on the line marked in red ).
>
> I have also set number of reducers to be 10  using - *
> job.setNumReduceTasks(10) *
> *
> *
> But, only one reduce task is performed  (line marked in blue).
>
> So, Please can you suggest how can I increase the number of reducers ?
>
> Below are some of the last lines from the log.
>
>
> -
> 12/05/14 10:32:46 INFO mapred.Task: Task '*attempt_local_0001_m_03_0*'
> done.
> 12/05/14 10:32:46 INFO mapred.LocalJobRunner:
> 12/05/14 10:32:46 INFO mapred.Merger: Merging 4 sorted segments
> 12/05/14 10:32:46 INFO mapred.Merger: Down to the last merge-pass, with 4
> segments left of total size: 8018 bytes
> 12/05/14 10:32:46 INFO mapred.LocalJobRunner:
> 12/05/14 10:32:46 INFO datajoin.job: key: 1 this.largestNumOfValues: 48
> 12/05/14 10:32:46 INFO mapred.Task: Task:attempt_local_0001_r_00_0 is
> done. And is in the process of commiting
> 12/05/14 10:32:46 INFO mapred.LocalJobRunner:
> 12/05/14 10:32:46 INFO mapred.Task: Task attempt_local_0001_r_00_0 is
> allowed to commit now
> 12/05/14 10:32:46 INFO mapred.FileOutputCommitter: Saved output of task
> 'attempt_local_0001_r_00_0' to
> file:/home/anwar/workspace/JoinLZOPfiles/OutLarge
> 12/05/14 10:32:49 INFO mapred.LocalJobRunner: actuallyCollectedCount 86
> collectedCount 86
> groupCount 25
>  > reduce
> 12/05/14 10:32:49 INFO mapred.Task: Task
> '*attempt_local_0001_r_00_0'*done.
> 12/05/14 10:32:50 INFO mapred.JobClient:  map 100% reduce 100%
> 12/05/14 10:32:50 INFO mapred.JobClient: Job complete: job_local_0001
> 12/05/14 10:32:50 INFO mapred.JobClient: Counters: 17
> 12/05/14 10:32:50 INFO mapred.JobClient:   File Input Format Counters
> 12/05/14 10:32:50 INFO mapred.JobClient: Bytes Read=1666
> 12/05/14 10:32:50 INFO mapred.JobClient:   File Output Format Counters
> 12/05/14 10:32:50 INFO mapred.JobClient: Bytes Written=2421
> 12/05/14 10:32:50 INFO mapred.JobClient:   FileSystemCounters
> 12/05/14 10:32:50 INFO mapred.JobClient: FILE_BYTES_READ=22890
> 12/05/14 10:32:50 INFO mapred.JobClient: FILE_BYTES_WRITTEN=194702
> 12/05/14 10:32:50 INFO mapred.JobClient:   Map-Reduce Framework
> 12/05/14 10:32:50 INFO mapred.JobClient: Map output materialized
> bytes=8034
> 12/05/14 10:32:50 INFO mapred.JobClient: Map input records=106
> 12/05/14 10:32:50 INFO mapred.JobClient: Reduce shuffle bytes=0
> 12/05/14 10:32:50 INFO mapred.JobClient: Spilled Records=212
> 12/05/14 10:32:50 INFO mapred.JobClient: Map output bytes=7798
> 12/05/14 10:32:50 INFO mapred.JobClient: Map input bytes=1666
> 12/05/14 10:32:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=472
> 12/05/14 10:32:50 INFO mapred.JobClient: Combine input records=0
> 12/05/14 10:32:50 INFO mapred.JobClient: Reduce input records=106
> 12/05/14 10:32:50 INFO mapred.JobClient: Reduce input groups=25
> 12/05/14 10:32:50 INFO mapred.JobClient: Combine output records=0
> 12/05/14 10:32:50 INFO mapred.JobClient: Reduce output records=86
> 12/05/14 10:32:50 INFO mapred.JobClient: Map output records=106
>
> --
> Mr. Anwar Shaikh
> Delhi Technological University, Delhi
> +91 92 50 77 12 44
>


Re: Terasort

2012-05-14 Thread Owen O'Malley
On Mon, May 14, 2012 at 10:40 AM, Barry, Sean F  wrote:
> I am having a bit of trouble understanding how the Terasort benchmark works, 
> especially the fundamentals of how the data is sorted. If the data is being 
> split into many chunks wouldn't it all have to be re-integrated back into the 
> entire dataset?

Before the job is launched, the input is sampled to find "cut" points.
Those cut points are used to assign keys to reduces. For example, if
you have 100 reduces, there are 99 keys chosen. All keys less than the
first are sent to the first reduce, between the first two keys are
sent to the second reduce and so on. The logic is done by the
TotalOrderPartitioner, which replaces MapReduce's default
HashPartitioner.

-- Owen


Random Sample in Map/Reduce

2012-05-14 Thread Shi Yu
Hi,

Before I raise this question I searched relevant topics. There 
are suggestions online:

"Mappers: Output all qualifying values, each with a random 
integer key.

Single reducer: Output the first N values, throwing away the 
keys."

However, this schema seems not very efficient when the data 
set is very huge, for example, sampling 100 out of one 
billion. Things are especially worse when Map task is 
computational demanding. I was trying to write a program to do 
sampling in Mappers, however, I ended up storing everything in 
memory and let the final sampling done at Mapper.cleanup() 
stage. It still seems not a graceful way to do it because it 
requires lots of memory. Maybe a better way is to control 
random sample at file.split() stage, is there any good 
approach existing?

Best,

Shi


Re: How to load raw log file into HDFS?

2012-05-14 Thread Manish Bhoge
You first need to copy data using copyFromLocal to your HDFS and then you can 
utilize PIG and Hive program for further analysis which run on map reduce. Yes 
warehouse directory is in HDFS. If you want to run(test) PIG in local then in 
that case you don't to copy data to HDFS
Sent from my BlackBerry, pls excuse typo

-Original Message-
From: Michael Wang 
Date: Mon, 14 May 2012 18:43:47 
To: common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: RE: How to load raw log file into HDFS?

I have the same question and I am glad to get you guys' help. I am also novice 
in Hadoop :)
I am using pig and hive to analyze the logs. My logs are in . 
Do I need to use "hadoop fs -copyFromLocal" to put files to  
first, and then load data files to pig or hive from ? Or can 
just load logs from Local_file_path directly to pig or hive? After I load the 
files to hive, I found it is put at /user/hive/warehouse. Is 
/user/hive/warehouse a HDFS?
How do I know what  are available? 

-Original Message-
From: Alexander Fahlke [mailto:alexander.fahlke.mailingli...@googlemail.com] 
Sent: Monday, May 14, 2012 1:53 AM
To: common-user@hadoop.apache.org
Subject: Re: How to load raw log file into HDFS?

Hi,

the best would be to read the documentation and some books to get familar
with Hadoop.

One of my favourite books is "Hadoop in Action" from Manning (
http://www.manning.com/lam/)
This book has an exmple for putting (log)-files into HDFS. Check out the
source "listing-3-1"

Later you can also check out Cloudera's Flume:
https://github.com/cloudera/flume/wiki

-- 
BR

Alexander Fahlke
Java Developer
www.nurago.com | www.fahlke.org


On Mon, May 14, 2012 at 7:24 AM, Amith D K  wrote:

> U can even use put/copyFromLocal
>
> both are similar and does the job via terminal.
>
> Or u can write a simple client program to do the job :)
>
> Amith
>
>
> 
> From: samir das mohapatra [samir.help...@gmail.com]
> Sent: Sunday, May 13, 2012 9:13 PM
> To: common-user@hadoop.apache.org
> Subject: Re: How to load raw log file into HDFS?
>
> Hi
> To load any file from local
> Command:
>  syntax: hadoop fs -copyFromLocal
>   Example hadoop fs -copyFromLocal input/logs
> hdfs://localhost/user/dataset/
>
>  More Commans:
> http://hadoop.apache.org/common/docs/r0.17.1/hdfs_shell.html
>
>
> On Sun, May 13, 2012 at 9:53 AM, AnExplorer 
> wrote:
>
> >
> > Hi, I am novice in Hadoop. Kindly suggest how do we load log files into
> > hdfs.
> > Please suggest the command and steps.
> > Thanks in advance!!
> > --
> > View this message in context:
> >
> http://old.nabble.com/How-to-load-raw-log-file-into-HDFS--tp33815208p33815208.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

This electronic message, including any attachments, may contain proprietary, 
confidential or privileged information for the sole use of the intended 
recipient(s). You are hereby notified that any unauthorized disclosure, 
copying, distribution, or use of this message is prohibited. If you have 
received this message in error, please immediately notify the sender by reply 
e-mail and delete it.



Re: How to load raw log file into HDFS?

2012-05-14 Thread Ranjith
You can load data directly into a hive table(external and internal) directly 
from the local file system. The same stands for pig. To Manish's point you can 
do the same using hadoop fs commands. I have tried it both ways and have seen a 
difference in performance. I would be interested to hear from the rest of the 
community about this to see it is consistent with what they have seen.

Thanks,
Ranjith

On May 14, 2012, at 8:45 PM, "Manish Bhoge"  wrote:

> You first need to copy data using copyFromLocal to your HDFS and then you can 
> utilize PIG and Hive program for further analysis which run on map reduce. 
> Yes warehouse directory is in HDFS. If you want to run(test) PIG in local 
> then in that case you don't to copy data to HDFS
> Sent from my BlackBerry, pls excuse typo
> 
> -Original Message-
> From: Michael Wang 
> Date: Mon, 14 May 2012 18:43:47 
> To: common-user@hadoop.apache.org
> Reply-To: common-user@hadoop.apache.org
> Subject: RE: How to load raw log file into HDFS?
> 
> I have the same question and I am glad to get you guys' help. I am also 
> novice in Hadoop :)
> I am using pig and hive to analyze the logs. My logs are in 
> . 
> Do I need to use "hadoop fs -copyFromLocal" to put files to  
> first, and then load data files to pig or hive from ? Or can 
> just load logs from Local_file_path directly to pig or hive? After I load the 
> files to hive, I found it is put at /user/hive/warehouse. Is 
> /user/hive/warehouse a HDFS?
> How do I know what  are available? 
> 
> -Original Message-
> From: Alexander Fahlke [mailto:alexander.fahlke.mailingli...@googlemail.com] 
> Sent: Monday, May 14, 2012 1:53 AM
> To: common-user@hadoop.apache.org
> Subject: Re: How to load raw log file into HDFS?
> 
> Hi,
> 
> the best would be to read the documentation and some books to get familar
> with Hadoop.
> 
> One of my favourite books is "Hadoop in Action" from Manning (
> http://www.manning.com/lam/)
> This book has an exmple for putting (log)-files into HDFS. Check out the
> source "listing-3-1"
> 
> Later you can also check out Cloudera's Flume:
> https://github.com/cloudera/flume/wiki
> 
> -- 
> BR
> 
> Alexander Fahlke
> Java Developer
> www.nurago.com | www.fahlke.org
> 
> 
> On Mon, May 14, 2012 at 7:24 AM, Amith D K  wrote:
> 
>> U can even use put/copyFromLocal
>> 
>> both are similar and does the job via terminal.
>> 
>> Or u can write a simple client program to do the job :)
>> 
>> Amith
>> 
>> 
>> 
>> From: samir das mohapatra [samir.help...@gmail.com]
>> Sent: Sunday, May 13, 2012 9:13 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: How to load raw log file into HDFS?
>> 
>> Hi
>> To load any file from local
>> Command:
>> syntax: hadoop fs -copyFromLocal
>>  Example hadoop fs -copyFromLocal input/logs
>> hdfs://localhost/user/dataset/
>> 
>> More Commans:
>> http://hadoop.apache.org/common/docs/r0.17.1/hdfs_shell.html
>> 
>> 
>> On Sun, May 13, 2012 at 9:53 AM, AnExplorer 
>> wrote:
>> 
>>> 
>>> Hi, I am novice in Hadoop. Kindly suggest how do we load log files into
>>> hdfs.
>>> Please suggest the command and steps.
>>> Thanks in advance!!
>>> --
>>> View this message in context:
>>> 
>> http://old.nabble.com/How-to-load-raw-log-file-into-HDFS--tp33815208p33815208.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>> 
>>> 
>> 
> 
> This electronic message, including any attachments, may contain proprietary, 
> confidential or privileged information for the sole use of the intended 
> recipient(s). You are hereby notified that any unauthorized disclosure, 
> copying, distribution, or use of this message is prohibited. If you have 
> received this message in error, please immediately notify the sender by reply 
> e-mail and delete it.
> 


Re: Random Sample in Map/Reduce

2012-05-14 Thread Shi Yu
To answer my own question.  I applied a non-repeatable random 
number generator in the mapper. At mapper setup stage I generate 
a pre-defined number of random numbers, then I use a counter 
along the mapper.  When the counter is contained in the random 
number set, the Mapper executes and outputs data. The problem 
now becomes how to know the ceiling of random number 
[1...ceiling]. That ceiling number cannot be too small to make 
sampling valid, it also cannot exceed the total number of data 
records contained in each split. The problem is because my data 
is not divided by line, sometimes a complete data record is 
composed by multiple lines, so I am not sure how to estimate 
that ceiling number ... Of course, if each line is a complete 
record, that ceiling number is easy to obtain.