Re: Splunk + Hadoop

2012-05-18 Thread Russell Jurney
Because that isn't Cube.

Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com

On May 18, 2012, at 2:01 PM, Ravi Shankar Nair
 wrote:

> Why not Hbase with Hadoop?
> It's a best bet.
> Rgds, Ravi
>
> Sent from my Beethoven
>
>
> On May 18, 2012, at 3:29 PM, Russell Jurney  wrote:
>
>> I'm playing with using Hadoop and Pig to load MongoDB with data for Cube to
>> consume. Cube  is a realtime tool...
>> but we'll be replaying events from the past.  Does that count?  It is nice
>> to batch backfill metrics into 'real-time' systems in bulk.
>>
>> On Fri, May 18, 2012 at 12:11 PM,  wrote:
>>
>>> Hi ,
>>>
>>> Has anyone used Hadoop and splunk, or any other real-time processing tool
>>> over Hadoop?
>>>
>>> Regards,
>>> Shreya
>>>
>>>
>>>
>>> This e-mail and any files transmitted with it are for the sole use of the
>>> intended recipient(s) and may contain confidential and privileged
>>> information. If you are not the intended recipient(s), please reply to the
>>> sender and destroy all copies of the original message. Any unauthorized
>>> review, use, disclosure, dissemination, forwarding, printing or copying of
>>> this email, and/or any action taken in reliance on the contents of this
>>> e-mail is strictly prohibited and may be unlawful.
>>>
>>
>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: Splunk + Hadoop

2012-05-18 Thread Ravi Shankar Nair
Why not Hbase with Hadoop?
It's a best bet.
Rgds, Ravi

Sent from my Beethoven 


On May 18, 2012, at 3:29 PM, Russell Jurney  wrote:

> I'm playing with using Hadoop and Pig to load MongoDB with data for Cube to
> consume. Cube  is a realtime tool...
> but we'll be replaying events from the past.  Does that count?  It is nice
> to batch backfill metrics into 'real-time' systems in bulk.
> 
> On Fri, May 18, 2012 at 12:11 PM,  wrote:
> 
>> Hi ,
>> 
>> Has anyone used Hadoop and splunk, or any other real-time processing tool
>> over Hadoop?
>> 
>> Regards,
>> Shreya
>> 
>> 
>> 
>> This e-mail and any files transmitted with it are for the sole use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. If you are not the intended recipient(s), please reply to the
>> sender and destroy all copies of the original message. Any unauthorized
>> review, use, disclosure, dissemination, forwarding, printing or copying of
>> this email, and/or any action taken in reliance on the contents of this
>> e-mail is strictly prohibited and may be unlawful.
>> 
> 
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: Why this problem is not solved yet ?

2012-05-18 Thread Ravi Shankar Nair
Hi Ravi,

Let me try and revert by couple of hours, thanks for input

Sent from my Beethoven 


On May 18, 2012, at 3:45 PM, Ravi Prakash  wrote:

> Hi Ravishankar,
> 
> I don't see two very important processes in your jps output. Just like
> there's JobTracker and NameNode, you should also have "TaskTracker" and
> "DataNode". The JobTracker only schedules jobs. To actually run the map
> reduce tasks, it needs TaskTrackers. This is why you see the jobtracker
> accepting your jobs and then getting stuck: because it doesn't have
> TaskTrackers to run that job on.
> 
> If I were you, I'd first check why the Datanode is not coming up. All 4
> daemons are necessary for running jobs. The logs for those two should be in
> the same directory in which you find the JT's logs.
> 
> Hope this helps.
> Ravi.
> 
> On Fri, May 18, 2012 at 5:17 AM, Ravishankar Nair <
> ravishankar.n...@gmail.com> wrote:
> 
>> Additionally, attached is the output of the job that I run( I mean the
>> example program named grep)
>> 
>> 
>> On Fri, May 18, 2012 at 6:15 AM, Ravishankar Nair <
>> ravishankar.n...@gmail.com> wrote:
>> 
>>> Hi Ravi,
>>> 
>>> Yes , it Running. Here is the output:-
>>> rn13067@WSUSJXLHRN13067 /home/hadoop-1.0.3
>>> $ jps
>>> 5068 NameNode
>>> 5836 Jps
>>> 3516 JobTracker
>>> 
>>> 
>>> Here are the logs from JOBTRACKER:-
>>> 
>>> 2012-05-17 21:41:31,772 INFO org.apache.hadoop.mapred.TaskTracker:
>>> STARTUP_MSG:
>>> /
>>> STARTUP_MSG: Starting TaskTracker
>>> 
>>> STARTUP_MSG:   host = WSUSJXLHRN13067/192.168.0.16
>>> STARTUP_MSG:   args = []
>>> STARTUP_MSG:   version = 1.0.3
>>> STARTUP_MSG:   build =
>>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
>>> 1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012
>>> /
>>> 2012-05-17 21:41:31,944 INFO
>>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
>>> hadoop-metrics2.properties
>>> 2012-05-17 21:41:31,990 INFO
>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>> MetricsSystem,sub=Stats registered.
>>> 2012-05-17 21:41:31,990 INFO
>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
>>> period at 10 second(s).
>>> 2012-05-17 21:41:31,990 INFO
>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics
>>> system started
>>> 2012-05-17 21:41:32,256 INFO
>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi
>>> registered.
>>> 2012-05-17 21:41:32,256 WARN
>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already
>>> exists!
>>> 2012-05-17 21:41:32,365 INFO org.mortbay.log: Logging to
>>> org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
>>> org.mortbay.log.Slf4jLog
>>> 2012-05-17 21:41:32,412 INFO org.apache.hadoop.http.HttpServer: Added
>>> global filtersafety
>>> (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
>>> 2012-05-17 21:41:32,428 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
>>> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
>>> 2012-05-17 21:41:32,444 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Starting tasktracker with owner as SYSTEM
>>> 2012-05-17 21:41:32,444 INFO org.apache.hadoop.mapred.TaskTracker: Good
>>> mapred local directories are: /tmp/hadoop-SYSTEM/mapred/local
>>> 2012-05-17 21:41:32,459 WARN org.apache.hadoop.util.NativeCodeLoader:
>>> Unable to load native-hadoop library for your platform... using
>>> builtin-java classes where applicable
>>> 2012-05-17 21:41:32,459 ERROR org.apache.hadoop.mapred.TaskTracker: Can
>>> not start task tracker because java.io.IOException: Failed to set
>>> permissions of path: \tmp\hadoop-SYSTEM\mapred\local\ttprivate to 0700
>>>at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
>>>at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
>>>at
>>> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
>>>at
>>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
>>>at
>>> org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
>>>at
>>> org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:728)
>>>at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:1459)
>>>at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3742)
>>> 
>>> 2012-05-17 21:41:32,459 INFO org.apache.hadoop.mapred.TaskTracker:
>>> SHUTDOWN_MSG:
>>> /
>>> SHUTDOWN_MSG: Shutting down TaskTracker at WSUSJXLHRN13067/192.168.0.16
>>> /
>>> 
>>> Any clue? Thanks
>>> Regards,
>>> ravi
>>> 
>>> 
>>> 
>>> On Fri, May 18, 2012 at 12:01 AM, Ravi Prakash wrote:
>>> 
 Ravishankar,
 
 If you run $ jps, do you see a TaskTracker process running? Can you
>

Re: Why this problem is not solved yet ?

2012-05-18 Thread Ravi Prakash
Hi Ravishankar,

I don't see two very important processes in your jps output. Just like
there's JobTracker and NameNode, you should also have "TaskTracker" and
"DataNode". The JobTracker only schedules jobs. To actually run the map
reduce tasks, it needs TaskTrackers. This is why you see the jobtracker
accepting your jobs and then getting stuck: because it doesn't have
TaskTrackers to run that job on.

If I were you, I'd first check why the Datanode is not coming up. All 4
daemons are necessary for running jobs. The logs for those two should be in
the same directory in which you find the JT's logs.

Hope this helps.
Ravi.

On Fri, May 18, 2012 at 5:17 AM, Ravishankar Nair <
ravishankar.n...@gmail.com> wrote:

> Additionally, attached is the output of the job that I run( I mean the
> example program named grep)
>
>
> On Fri, May 18, 2012 at 6:15 AM, Ravishankar Nair <
> ravishankar.n...@gmail.com> wrote:
>
>> Hi Ravi,
>>
>> Yes , it Running. Here is the output:-
>> rn13067@WSUSJXLHRN13067 /home/hadoop-1.0.3
>> $ jps
>> 5068 NameNode
>> 5836 Jps
>> 3516 JobTracker
>>
>>
>> Here are the logs from JOBTRACKER:-
>>
>> 2012-05-17 21:41:31,772 INFO org.apache.hadoop.mapred.TaskTracker:
>> STARTUP_MSG:
>> /
>> STARTUP_MSG: Starting TaskTracker
>>
>> STARTUP_MSG:   host = WSUSJXLHRN13067/192.168.0.16
>> STARTUP_MSG:   args = []
>> STARTUP_MSG:   version = 1.0.3
>> STARTUP_MSG:   build =
>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
>> 1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012
>> /
>> 2012-05-17 21:41:31,944 INFO
>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
>> hadoop-metrics2.properties
>> 2012-05-17 21:41:31,990 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>> MetricsSystem,sub=Stats registered.
>> 2012-05-17 21:41:31,990 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
>> period at 10 second(s).
>> 2012-05-17 21:41:31,990 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics
>> system started
>> 2012-05-17 21:41:32,256 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi
>> registered.
>> 2012-05-17 21:41:32,256 WARN
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already
>> exists!
>> 2012-05-17 21:41:32,365 INFO org.mortbay.log: Logging to
>> org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
>> org.mortbay.log.Slf4jLog
>> 2012-05-17 21:41:32,412 INFO org.apache.hadoop.http.HttpServer: Added
>> global filtersafety
>> (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
>> 2012-05-17 21:41:32,428 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
>> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
>> 2012-05-17 21:41:32,444 INFO org.apache.hadoop.mapred.TaskTracker:
>> Starting tasktracker with owner as SYSTEM
>> 2012-05-17 21:41:32,444 INFO org.apache.hadoop.mapred.TaskTracker: Good
>> mapred local directories are: /tmp/hadoop-SYSTEM/mapred/local
>> 2012-05-17 21:41:32,459 WARN org.apache.hadoop.util.NativeCodeLoader:
>> Unable to load native-hadoop library for your platform... using
>> builtin-java classes where applicable
>> 2012-05-17 21:41:32,459 ERROR org.apache.hadoop.mapred.TaskTracker: Can
>> not start task tracker because java.io.IOException: Failed to set
>> permissions of path: \tmp\hadoop-SYSTEM\mapred\local\ttprivate to 0700
>> at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
>> at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
>> at
>> org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
>> at
>> org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:728)
>> at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:1459)
>> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3742)
>>
>> 2012-05-17 21:41:32,459 INFO org.apache.hadoop.mapred.TaskTracker:
>> SHUTDOWN_MSG:
>> /
>> SHUTDOWN_MSG: Shutting down TaskTracker at WSUSJXLHRN13067/192.168.0.16
>> /
>>
>> Any clue? Thanks
>> Regards,
>> ravi
>>
>>
>>
>> On Fri, May 18, 2012 at 12:01 AM, Ravi Prakash wrote:
>>
>>> Ravishankar,
>>>
>>> If you run $ jps, do you see a TaskTracker process running? Can you
>>> please
>>> post the tasktracker logs as well?
>>>
>>> On Thu, May 17, 2012 at 8:49 PM, Ravishankar Nair <
>>> ravishankar.n...@gmail.com> wrote:
>>>
>>> > Dear experts,
>>> >
>>> > Today is my tenth day working with Hadoop on installing on my windows
>>> > machine. I am trying again an

Re: Splunk + Hadoop

2012-05-18 Thread Russell Jurney
I'm playing with using Hadoop and Pig to load MongoDB with data for Cube to
consume. Cube  is a realtime tool...
but we'll be replaying events from the past.  Does that count?  It is nice
to batch backfill metrics into 'real-time' systems in bulk.

On Fri, May 18, 2012 at 12:11 PM,  wrote:

> Hi ,
>
> Has anyone used Hadoop and splunk, or any other real-time processing tool
> over Hadoop?
>
> Regards,
> Shreya
>
>
>
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful.
>

Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Splunk + Hadoop

2012-05-18 Thread Shreya.Pal
Hi ,

Has anyone used Hadoop and splunk, or any other real-time processing tool over 
Hadoop?

Regards,
Shreya



This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful.


Need Urgent help

2012-05-18 Thread samir das mohapatra
Hi

   I wanted to implement one Workflow within the MAPPER. I am Sharing my
concept through the Architecture Diagram, Please correct me if I am wrong
   and suggest me any Good Approach for that

   Many thanks  in advance)

  Thanks


Re: Hadoop-on-demand and torque

2012-05-18 Thread Pierre Antoine DuBoDeNa
I am also interested to learn about myHadoop as I use a shared storage
system and everything runs on VMs and not actual dedicated servers.

in like amazon EC2 environment which you just have VMs and huge central
storage, is it any helpful to use hadoop to distribute jobs and maybe
parallelize algorithms, or is better to go with other technologies?

2012/5/18 Manu S 

> Hi All,
>
> Guess HOD could be useful existing HPC cluster with Torque scheduler which
> needs to run map-reduce jobs.
>
> Also read about *myHadoop- Hadoop on demand on traditional HPC
> resources*will support many HPC schedulers like SGE, PBS etc to over
> come the
> integration of shared-architecture(HPC) & shared-nothing
> architecture(Hadoop).
>
> Any real use case scenarios for integrating hadoop map/reduce in existing
> HPC cluster and what are the advantages of using hadoop features in HPC
> cluster?
>
> Appreciate your comments on the same.
>
> Thanks,
> Manu S
>
>
>
> On Fri, May 18, 2012 at 12:41 AM, Merto Mertek 
> wrote:
>
> > If I understand it right HOD is mentioned mainly for merging existing HPC
> > clusters with hadoop and for testing purposes..
> >
> > I cannot find what is the role of Torque here (just initial nodes
> > allocation?) and which is the default scheduler of HOD ?  Probably the
> > scheduler from the hadoop distribution?
> >
> > In the doc is mentioned a MAUI scheduler, but probably if there would be
> an
> > integration with hadoop there will be any document on it..
> >
> > thanks..
> >
>


Re: how to rebalance individual data node?

2012-05-18 Thread Harsh J
Jim,

The HDFS balancer presently does not look at the disks of a DN. They
only view DNs on the whole (sum of all usage). The improvement to
balance disks of a single DN is trackable at
https://issues.apache.org/jira/browse/HDFS-1312

You may balance your disks out manually, however:
http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F

On Fri, May 18, 2012 at 5:54 PM, Jim Donofrio  wrote:
> Lets say that every node in your cluster has 2 same sized disks and one is
> 50% full and the other is 100% full. According to my understanding of the
> balancer documentation, all data nodes will be at the average utilization of
> 75% so no balancing will occur yet one hard drive in each node is struggling
> at capacity. Is there any way to run the balancer just on a datanode to
> force each disk to be 75% full?
>
> Thanks



-- 
Harsh J


how to rebalance individual data node?

2012-05-18 Thread Jim Donofrio
Lets say that every node in your cluster has 2 same sized disks and one 
is 50% full and the other is 100% full. According to my understanding of 
the balancer documentation, all data nodes will be at the average 
utilization of 75% so no balancing will occur yet one hard drive in each 
node is struggling at capacity. Is there any way to run the balancer 
just on a datanode to force each disk to be 75% full?


Thanks


Re: is hadoop suitable for us?

2012-05-18 Thread Luca Pireddu
We're using a multi-user Hadoop MapReduce installation with up to 100 
computing nodes, without HDFS.  Since we have a shared cluster and not 
all apps use Hadoop, we grow/shrink the Hadoop cluster as the load 
changes.  It's working, and because of our hardware setup performance is 
quite close to what we had with HDFS.  We're storing everything directly 
on the SAN.


The only problem so far has been trying to get the system to work 
without running the JT as root (I posted yesterday about that problem).



Luca




On 05/18/2012 06:10 AM, Pierre Antoine DuBoDeNa wrote:

You used HDFS too? or storing everything on SAN immediately?

I don't have number of GB/TB (it might be about 2TB so not really that
"huge") but they are more than 100 million documents to be processed. In a
single machine currently we can process about 200.000 docs/day (several
parsing, indexing, metadata extraction has to be done). So in the worst
case we want to use the 50 VMs to distribute the processing..

2012/5/17 Sagar Shukla


Hi PA,
 In my environment, we had a SAN storage and I/O was pretty good. So if
you have similar environment then I don't see any performance issues.

Just out of curiosity - what amount of data are you looking forward to
process ?

Regards,
Sagar

-Original Message-
From: Pierre Antoine Du Bois De Naurois [mailto:pad...@gmail.com]
Sent: Thursday, May 17, 2012 8:29 PM
To: common-user@hadoop.apache.org
Subject: Re: is hadoop suitable for us?

Thanks Sagar, Mathias and Michael for your replies.

It seems we will have to go with hadoop even if I/O will be slow due to
our configuration.

I will try to update on how it worked for our case.

Best,
PA



2012/5/17 Michael Segel


The short answer is yes.
The longer answer is that you will have to account for the latencies.

There is more but you get the idea..

Sent from my iPhone

On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois"<
pad...@gmail.com>  wrote:


We have large amount of text files that we want to process and index

(plus

applying other algorithms).

The problem is that our configuration is share-everything while
hadoop

has

a share-nothing configuration.

We have 50 VMs and not actual servers, and these share a huge
central storage. So using HDFS might not be really useful as
replication will not help, distribution of files have no meaning as
all files will be again located in the same HDD. I am afraid that
I/O will be very slow with or without HDFS. So i am wondering if it
will really help us to use hadoop/hbase/pig etc. to distribute and
do several parallel tasks.. or is "better" to install something
different (which i am not sure what). We heard myHadoop is better
for such kind of configurations, have any clue about it?

For example we now have a central mySQL to check if we have already
processed a document and keeping there several metadata. Soon we
will

have

to distribute it as there is not enough space in one VM, But
Hadoop/HBase will be useful? we don't want to do any complex
join/sort of the data, we just want to do queries to check if
already processed a document, and if not to add it with several of

it's metadata.


We heard sungrid for example is another way to go but it's
commercial. We are somewhat lost.. so any help/ideas/suggestions are

appreciated.


Best,
PA



2012/5/17 Abhishek Pratap Singh


Hi,

For your question if HADOOP can be used without HDFS, the answer is

Yes.

Hadoop can be used with any kind of distributed file system.
But I m not able to understand the problem statement clearly to
advice

my

point of view.
Are you processing text file and saving in distributed database??

Regards,
Abhishek

On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois
<  pad...@gmail.com>  wrote:


We want to distribute processing of text files.. processing of
large machine learning tasks, have a distributed database as we
have big

amount

of data etc.

The problem is that each VM can have up to 2TB of data (limitation
of

VM),

and we have 20TB of data. So we have to distribute the processing,
the database etc. But all those data will be in a shared huge
central file system.

We heard about myHadoop, but we are not sure why is that any
different

from

Hadoop.

If we run hadoop/mapreduce without using HDFS? is that an option?

best,
PA


2012/5/17 Mathias Herberts


Hadoop does not perform well with shared storage and vms.

The question should be asked first regarding what you're trying
to

achieve,

not about your infra.
On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois"<
pad...@gmail.com>  wrote:


Hello,

We have about 50 VMs and we want to distribute processing across

them.

However these VMs share a huge data storage system and thus
their

"virtual"

HDD are all located in the same computer. Would Hadoop be useful
for

such

configuration? Could we use hadoop without HDFS? so that we can

retrieve

and store everything in the same storage?

Thanks,
PA







--
Luca Pireddu
CRS4

Re: custom FileInputFormat class

2012-05-18 Thread John Hancock
Devaraj,

Thanks for the pointer.

I ended up extending FileInputFormat.

I made some notes about the program I wrote to use the custom
FileInputFormat here:

https://cakephp.rootser.com/posts/view/64

I think it may be because I'm using 1.0.1, but I did not need to write a
getSplits() method.  However, I did need to write an IsSplittable(), where
I just went with the default implementation.  Is the way that one makes
one's input splittable to assign the job configuration a codec?

Also, I think that if I now take my FileInputFormat object
(RootserFileInputFormat in the page I link to above) and change the
nextKeyValue() method to use and ObjectInputStream, and modify
RootserFileInputFormat to have a type parameter, and make the type of the
object nextKeyValue() reads out of the split the same type as the parameter
of RootserFileInputFormat, I will have a FileInputFormat object that can
read any kind of (serializable) object out of a split.  While this is cool,
I can't believe I am the first person who thought of something like this.
Do you know if there is already a way to do this using the Hadoop framework?

Thanks for the pointer on how to get started.

On Thu, May 17, 2012 at 6:32 AM, Devaraj k  wrote:

> Hi John,
>
>
> You can extend  FileInputFormat(or implement InputFormat) and then you
> need to implement below methods.
>
> 1. InputSplit[] getSplits(JobConf job, int numSplits)  : For splitting the
> input files logically for the job. If FileInputFormat.getSplits(JobConf
> job, int numSplits) suits for your requirement, you can make use of it.
> Otherwise you can implement it based on your need.
>
> 2. RecordReader RecordReader(InputSplit split, JobConf job, Reporter
> reporter) : For reading the input split.
>
>
> Thanks
> Devaraj
>
> 
> From: John Hancock [jhancock1...@gmail.com]
> Sent: Thursday, May 17, 2012 3:40 PM
> To: common-user@hadoop.apache.org
> Subject: custom FileInputFormat class
>
> All,
>
> Can anyone on the list point me in the right direction as to how to write
> my own FileInputFormat class?
>
> Perhaps this is not even the way I should go, but my goal is to write a
> MapReduce job that gets its input from a binary file of integers and longs.
>
> -John
>


Re: Why this problem is not solved yet ?

2012-05-18 Thread Ravishankar Nair
Hi Ravi,

Yes , it Running. Here is the output:-
rn13067@WSUSJXLHRN13067 /home/hadoop-1.0.3
$ jps
5068 NameNode
5836 Jps
3516 JobTracker


Here are the logs from JOBTRACKER:-

2012-05-17 21:41:31,772 INFO org.apache.hadoop.mapred.TaskTracker:
STARTUP_MSG:
/
STARTUP_MSG: Starting TaskTracker
STARTUP_MSG:   host = WSUSJXLHRN13067/192.168.0.16
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012
/
2012-05-17 21:41:31,944 INFO org.apache.hadoop.metrics2.impl.MetricsConfig:
loaded properties from hadoop-metrics2.properties
2012-05-17 21:41:31,990 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
MetricsSystem,sub=Stats registered.
2012-05-17 21:41:31,990 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
period at 10 second(s).
2012-05-17 21:41:31,990 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics
system started
2012-05-17 21:41:32,256 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi
registered.
2012-05-17 21:41:32,256 WARN
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already
exists!
2012-05-17 21:41:32,365 INFO org.mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
2012-05-17 21:41:32,412 INFO org.apache.hadoop.http.HttpServer: Added
global filtersafety
(class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
2012-05-17 21:41:32,428 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-05-17 21:41:32,444 INFO org.apache.hadoop.mapred.TaskTracker: Starting
tasktracker with owner as SYSTEM
2012-05-17 21:41:32,444 INFO org.apache.hadoop.mapred.TaskTracker: Good
mapred local directories are: /tmp/hadoop-SYSTEM/mapred/local
2012-05-17 21:41:32,459 WARN org.apache.hadoop.util.NativeCodeLoader:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
2012-05-17 21:41:32,459 ERROR org.apache.hadoop.mapred.TaskTracker: Can not
start task tracker because java.io.IOException: Failed to set permissions
of path: \tmp\hadoop-SYSTEM\mapred\local\ttprivate to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at
org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:728)
at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:1459)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3742)

2012-05-17 21:41:32,459 INFO org.apache.hadoop.mapred.TaskTracker:
SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down TaskTracker at WSUSJXLHRN13067/192.168.0.16
/

Any clue? Thanks
Regards,
ravi


On Fri, May 18, 2012 at 12:01 AM, Ravi Prakash  wrote:

> Ravishankar,
>
> If you run $ jps, do you see a TaskTracker process running? Can you please
> post the tasktracker logs as well?
>
> On Thu, May 17, 2012 at 8:49 PM, Ravishankar Nair <
> ravishankar.n...@gmail.com> wrote:
>
> > Dear experts,
> >
> > Today is my tenth day working with Hadoop on installing on my windows
> > machine. I am trying again and again because , some where someone has
> > written that it works on Windows with CYGWIN.(And noone has written that
> > Hadoop wont work on Windows). I am attaching my config files.
> >
> > Kindly help me, if anything can make this work.A feeble and humble
> request
> > to all experts out there.
> >
> > Here is the error, if you search , you can see thousands have reported
> > this and there is no solution I got yet, though I tried all ways
> possible.
> > I am using Windows XP SP3, Hadoop (tried with five versions so far
> > including 1.0.3).  I am running on a single node.(machine WSUSJXLHRN13067
> > IP:- 192.168.0.16)
> > When I start Hadoop, no issues in any of the versions
> >
> > rn13067@WSUSJXLHRN13067 /home/hadoop-1.0.3
> > $ bin/start-all.sh
> > starting namenode, logging to
> >
> /home/hadoop-1.0.3/libexec/../logs/hadoop-SUNDOOP-namenode-WSUSJXLHRN13067.out
> > localhost: starting datanode, logging to
> >
> /home/hadoop-1.0.3/libexec/../logs/hadoop-SUNDOOP-datanode-WSUSJXLHRN13067.out
> > localhost: starting secondarynamenode, logging to
> >
> /home/hadoop-1.0.3/libexec/../logs/hadoop-SUNDOOP-secondarynamenode-WSUSJXLHRN13067.out
> > starting jobtracker, logging to
> 

Re: Hadoop-on-demand and torque

2012-05-18 Thread Manu S
Hi All,

Guess HOD could be useful existing HPC cluster with Torque scheduler which
needs to run map-reduce jobs.

Also read about *myHadoop- Hadoop on demand on traditional HPC
resources*will support many HPC schedulers like SGE, PBS etc to over
come the
integration of shared-architecture(HPC) & shared-nothing
architecture(Hadoop).

Any real use case scenarios for integrating hadoop map/reduce in existing
HPC cluster and what are the advantages of using hadoop features in HPC
cluster?

Appreciate your comments on the same.

Thanks,
Manu S



On Fri, May 18, 2012 at 12:41 AM, Merto Mertek  wrote:

> If I understand it right HOD is mentioned mainly for merging existing HPC
> clusters with hadoop and for testing purposes..
>
> I cannot find what is the role of Torque here (just initial nodes
> allocation?) and which is the default scheduler of HOD ?  Probably the
> scheduler from the hadoop distribution?
>
> In the doc is mentioned a MAUI scheduler, but probably if there would be an
> integration with hadoop there will be any document on it..
>
> thanks..
>


Re: is hadoop suitable for us?

2012-05-18 Thread Michael Segel
You are going to have to put HDFS on top of your SAN. 

The issue is that you introduce overhead and latencies by having attached 
storage rather than the drives physically on the bus within the case. 

Also I'm going to assume that your SAN is using RAID. 
One of the side effects of using a SAN is that you could reduce your 
replication factor from 3 to 2. 
(The SAN already protects you from disk failures if you're using RAID)


On May 17, 2012, at 11:10 PM, Pierre Antoine DuBoDeNa wrote:

> You used HDFS too? or storing everything on SAN immediately?
> 
> I don't have number of GB/TB (it might be about 2TB so not really that
> "huge") but they are more than 100 million documents to be processed. In a
> single machine currently we can process about 200.000 docs/day (several
> parsing, indexing, metadata extraction has to be done). So in the worst
> case we want to use the 50 VMs to distribute the processing..
> 
> 2012/5/17 Sagar Shukla 
> 
>> Hi PA,
>>In my environment, we had a SAN storage and I/O was pretty good. So if
>> you have similar environment then I don't see any performance issues.
>> 
>> Just out of curiosity - what amount of data are you looking forward to
>> process ?
>> 
>> Regards,
>> Sagar
>> 
>> -Original Message-
>> From: Pierre Antoine Du Bois De Naurois [mailto:pad...@gmail.com]
>> Sent: Thursday, May 17, 2012 8:29 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: is hadoop suitable for us?
>> 
>> Thanks Sagar, Mathias and Michael for your replies.
>> 
>> It seems we will have to go with hadoop even if I/O will be slow due to
>> our configuration.
>> 
>> I will try to update on how it worked for our case.
>> 
>> Best,
>> PA
>> 
>> 
>> 
>> 2012/5/17 Michael Segel 
>> 
>>> The short answer is yes.
>>> The longer answer is that you will have to account for the latencies.
>>> 
>>> There is more but you get the idea..
>>> 
>>> Sent from my iPhone
>>> 
>>> On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois" <
>>> pad...@gmail.com> wrote:
>>> 
 We have large amount of text files that we want to process and index
>>> (plus
 applying other algorithms).
 
 The problem is that our configuration is share-everything while
 hadoop
>>> has
 a share-nothing configuration.
 
 We have 50 VMs and not actual servers, and these share a huge
 central storage. So using HDFS might not be really useful as
 replication will not help, distribution of files have no meaning as
 all files will be again located in the same HDD. I am afraid that
 I/O will be very slow with or without HDFS. So i am wondering if it
 will really help us to use hadoop/hbase/pig etc. to distribute and
 do several parallel tasks.. or is "better" to install something
 different (which i am not sure what). We heard myHadoop is better
 for such kind of configurations, have any clue about it?
 
 For example we now have a central mySQL to check if we have already
 processed a document and keeping there several metadata. Soon we
 will
>>> have
 to distribute it as there is not enough space in one VM, But
 Hadoop/HBase will be useful? we don't want to do any complex
 join/sort of the data, we just want to do queries to check if
 already processed a document, and if not to add it with several of
>> it's metadata.
 
 We heard sungrid for example is another way to go but it's
 commercial. We are somewhat lost.. so any help/ideas/suggestions are
>> appreciated.
 
 Best,
 PA
 
 
 
 2012/5/17 Abhishek Pratap Singh 
 
> Hi,
> 
> For your question if HADOOP can be used without HDFS, the answer is
>> Yes.
> Hadoop can be used with any kind of distributed file system.
> But I m not able to understand the problem statement clearly to
> advice
>>> my
> point of view.
> Are you processing text file and saving in distributed database??
> 
> Regards,
> Abhishek
> 
> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois
> < pad...@gmail.com> wrote:
> 
>> We want to distribute processing of text files.. processing of
>> large machine learning tasks, have a distributed database as we
>> have big
>>> amount
>> of data etc.
>> 
>> The problem is that each VM can have up to 2TB of data (limitation
>> of
> VM),
>> and we have 20TB of data. So we have to distribute the processing,
>> the database etc. But all those data will be in a shared huge
>> central file system.
>> 
>> We heard about myHadoop, but we are not sure why is that any
>> different
> from
>> Hadoop.
>> 
>> If we run hadoop/mapreduce without using HDFS? is that an option?
>> 
>> best,
>> PA
>> 
>> 
>> 2012/5/17 Mathias Herberts 
>> 
>>> Hadoop does not perform well with shared storage and vms.
>>> 
>>> The question should be asked first regarding what y

Re: Append supported in hadoop 1.0.x branch?

2012-05-18 Thread Rodney O'Donnell
Perfect, thanks for the clarification.


On Fri, May 18, 2012 at 5:58 PM, Harsh J  wrote:

> Rodney,
>
> There are two things that comprised the 0.20-append branch which added
> "append" features, which to break down simply for 1.x:
>
> append() - Available: Yes. Supported/Recommended: No.
> sync() - Available: Yes. Supported/Recommended: Yes.
>
> Please also see these links for further info/conversations on this
> topic thats happened several times before:
>
> https://issues.apache.org/jira/browse/HADOOP-8230
> http://search-hadoop.com/m/638TD3bAXB1
> http://search-hadoop.com/m/hBPRp1EWELS1
>
> Let us know if you have further questions.
>
> On Fri, May 18, 2012 at 12:12 PM, Rodney O'Donnell 
> wrote:
> > Hi,
> >
> > Is FileSystem.append supported on hadoop 1.0.x?  (1.0.3 in particular).
> >
> > Reading this list I thought it was back in for 1.0, but it's disabled by
> > default so I'm not 100% sure.
> > It would be great to get a definitive answer.
> >
> > Cheers,
> >
> > Rod.
>
>
>
> --
> Harsh J
>


Re: Append supported in hadoop 1.0.x branch?

2012-05-18 Thread Harsh J
Rodney,

There are two things that comprised the 0.20-append branch which added
"append" features, which to break down simply for 1.x:

append() - Available: Yes. Supported/Recommended: No.
sync() - Available: Yes. Supported/Recommended: Yes.

Please also see these links for further info/conversations on this
topic thats happened several times before:

https://issues.apache.org/jira/browse/HADOOP-8230
http://search-hadoop.com/m/638TD3bAXB1
http://search-hadoop.com/m/hBPRp1EWELS1

Let us know if you have further questions.

On Fri, May 18, 2012 at 12:12 PM, Rodney O'Donnell  wrote:
> Hi,
>
> Is FileSystem.append supported on hadoop 1.0.x?  (1.0.3 in particular).
>
> Reading this list I thought it was back in for 1.0, but it's disabled by
> default so I'm not 100% sure.
> It would be great to get a definitive answer.
>
> Cheers,
>
> Rod.



-- 
Harsh J