Re: tutorial on Hadoop/Hbase utility classes

2011-09-01 Thread Arun C Murthy
Thanks for putting this up, it's very useful.

I'd encourage you to contribute this a documentation patch so that you help 
everyone who comes to hadoop.apache.org, plus you can be a part of the project 
and a contributor.

I can help with the mechanics - here is a link to help you get started:
http://wiki.apache.org/hadoop/HowToContribute

Arun

On Aug 31, 2011, at 4:57 PM, Sujee Maniyam wrote:

 Here is a tutorial on some handy Hadoop classes - with sample source code.
 
 http://sujee.net/tech/articles/hadoop-useful-classes/
 
 Would appreciate any feedback / suggestions.
 
 thanks  all
 Sujee Maniyam
 http://sujee.net



Re: Binary content

2011-09-01 Thread Dieter Plaetinck
On Wed, 31 Aug 2011 08:44:42 -0700
Mohit Anchlia mohitanch...@gmail.com wrote:

 Does map-reduce work well with binary contents in the file? This
 binary content is basically some CAD files and map reduce program need
 to read these files using some proprietry tool extract values and do
 some processing. Wondering if there are others doing similar type of
 processing. Best practices etc.

yes, it works.  you just need to select the right input format.
Personally i store all my binary files into a sequencefile (because my binary 
files are small)

Dieter


Timer jobs

2011-09-01 Thread Per Steffensen

Hi

I use hadoop for a MapReduce job in my system. I would like to have the 
job run very 5th minute. Are there any distributed timer job stuff in 
hadoop? Of course I could setup a timer in an external timer framework 
(CRON or something like that) that invokes the MapReduce job. But CRON 
is only running on one particular machine, so if that machine goes down 
my job will not be triggered. Then I could setup the timer on all or 
many machines, but I would not like the job to be run in more than one 
instance every 5th minute, so then the timer jobs would need to 
coordinate who is actually starting the job this time and all the rest 
would just have to do nothing. Guess I could come up with a solution to 
that - e.g. writing some lock stuff using HDFS files or by using 
ZooKeeper. But I would really like if someone had already solved the 
problem, and provided some kind of a distributed timer framework 
running in a cluster, so that I could just register a timer job with 
the cluster, and then be sure that it is invoked every 5th minute, no 
matter if one or two particular machines in the cluster is down.


Any suggestions are very welcome.

Regards, Per Steffensen


Re: Timer jobs

2011-09-01 Thread Ronen Itkin
Hi

Try to use Oozie for job coordination and work flows.



On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote:

 Hi

 I use hadoop for a MapReduce job in my system. I would like to have the job
 run very 5th minute. Are there any distributed timer job stuff in hadoop?
 Of course I could setup a timer in an external timer framework (CRON or
 something like that) that invokes the MapReduce job. But CRON is only
 running on one particular machine, so if that machine goes down my job will
 not be triggered. Then I could setup the timer on all or many machines, but
 I would not like the job to be run in more than one instance every 5th
 minute, so then the timer jobs would need to coordinate who is actually
 starting the job this time and all the rest would just have to do nothing.
 Guess I could come up with a solution to that - e.g. writing some lock
 stuff using HDFS files or by using ZooKeeper. But I would really like if
 someone had already solved the problem, and provided some kind of a
 distributed timer framework running in a cluster, so that I could just
 register a timer job with the cluster, and then be sure that it is invoked
 every 5th minute, no matter if one or two particular machines in the cluster
 is down.

 Any suggestions are very welcome.

 Regards, Per Steffensen




-- 
*
Ronen Itkin*
Taykey | www.taykey.com


Re: Hadoop with Netapp

2011-09-01 Thread Steve Loughran

On 25/08/11 08:20, Sagar Shukla wrote:

Hi Hakan,

 Please find my comments inline in blue :



-Original Message-
From: Hakan (c)lter [mailto:hakanil...@gmail.com]
Sent: Thursday, August 25, 2011 12:28 PM
To: common-user@hadoop.apache.org
Subject: Hadoop with Netapp



Hi everyone,



We are going to create a new Hadoop cluster in our company, i have to get some 
advises from you:



1. Does anyone have stored whole Hadoop data not on local disks but on Netapp 
or other storage system? Do we have to store datas on local disks, if so is it 
because of performace issues?



sagar: Yes, we were using SAN LUNs for storing Hadoop data. SAN works faster 
than NAS in terms of performance while writing the data to the storage. Also SAN LUNs 
can be auto-mounted while booting up the system.


Silly question: why? SANs are SPOFs (Gray  van Ingen, MS, 2005; SAN 
responsible for 11% of terraserver downtime).


Was it because you had the rack and wanted to run Hadoop, or did you 
want a more agile cluster? Because it's going to increase your cost of 
storage dramatically, which means you pay more per TB, or end up with 
less TB of storage. I wouldn't go this way for a dedicated Hadoop 
cluster. For a multi-use cluster, it's a different story







2. What do you think about running Hadoop nodes in virtual (VMware) servers?



sagar: If high speed computing is not a requirement for you then Hadoop nodes 
in VM environment could be a good option, but one other slight drawback is when the 
VM crashes recovery of the in-memory data would be gone. Hadoop takes care of some 
amount of failover, but there is some amount of risk involved and requires good HA 
building capabilities.



I do it for dev and test work, and for isolated clusters in a shared 
environment.


-for CPU bound stuff, it actually works quite well, as there's no 
significant overhead


-for HDD access, reading from the FS, writing to the FS and to store 
transient spill data you take a tangible performance hit. That's OK if 
you can afford to wait or rent a few extra CPUs -and your block size is 
such that those extra servers can help out -which may be in the map 
phase more than the reduce phase



Some Hadoop-ish projects -Stratosphere from TuB in particular- are 
designed for VM infrastructure so come up with execution plans to use 
VMs efficiently.


-steve


Re: Turn off all Hadoop logs?

2011-09-01 Thread Steve Loughran

On 29/08/11 20:31, Frank Astier wrote:

Is it possible to turn off all the Hadoop logs simultaneously? In my unit 
tests, I don’t want to see the myriad “INFO” logs spewed out by various Hadoop 
components. I’m using:

   ((Log4JLogger) DataNode.LOG).getLogger().setLevel(Level.OFF);
 ((Log4JLogger) LeaseManager.LOG).getLogger().setLevel(Level.OFF);
 ((Log4JLogger) FSNamesystem.LOG).getLogger().setLevel(Level.OFF);
 ((Log4JLogger) DFSClient.LOG).getLogger().setLevel(Level.OFF);
 ((Log4JLogger) Storage.LOG).getLogger().setLevel(Level.OFF);

But I’m still missing some loggers...



you need a log4j.properties file on the CP that doesn't log so much. I 
do this by


 -removing /logj4.properties from the Hadoop jars in our (private) jar 
repository

 -having custom log4.properties files in the test/ source trees

You could also start junit with the right log4j properties to point it 
at a custom log4j file. I forget what that property is.




Re: Timer jobs

2011-09-01 Thread Per Steffensen

Hi

Thanks a lot for pointing me to Oozie. I have looked a little bit into 
Oozie and it seems like the component triggering jobs is called 
Coordinator Application. But I really see nowhere that this 
Coordinator Application doesnt just run on a single machine, and that it 
will therefore not trigger anything if this machine is down. Can you 
confirm that the Coordinator Application-role is distributed in a 
distribued Oozie setup, so that jobs gets triggered even if one or two 
machines are down?


Regards, Per Steffensen

Ronen Itkin skrev:

Hi

Try to use Oozie for job coordination and work flows.



On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote:

  

Hi

I use hadoop for a MapReduce job in my system. I would like to have the job
run very 5th minute. Are there any distributed timer job stuff in hadoop?
Of course I could setup a timer in an external timer framework (CRON or
something like that) that invokes the MapReduce job. But CRON is only
running on one particular machine, so if that machine goes down my job will
not be triggered. Then I could setup the timer on all or many machines, but
I would not like the job to be run in more than one instance every 5th
minute, so then the timer jobs would need to coordinate who is actually
starting the job this time and all the rest would just have to do nothing.
Guess I could come up with a solution to that - e.g. writing some lock
stuff using HDFS files or by using ZooKeeper. But I would really like if
someone had already solved the problem, and provided some kind of a
distributed timer framework running in a cluster, so that I could just
register a timer job with the cluster, and then be sure that it is invoked
every 5th minute, no matter if one or two particular machines in the cluster
is down.

Any suggestions are very welcome.

Regards, Per Steffensen






  




Re: Timer jobs

2011-09-01 Thread Ronen Itkin
If I get you right you are asking about Installing Oozie as Distributed
and/or HA cluster?!
In that case I am not familiar with an out of the box solution by Oozie.
But, I think you can made up a solution of your own, for example:
Installing Oozie on two servers on the same partition which will be
synchronized by DRBD.
You can trigger a failover using linux Heartbeat and that way maintain a
virtual IP.





On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk wrote:

 Hi

 Thanks a lot for pointing me to Oozie. I have looked a little bit into
 Oozie and it seems like the component triggering jobs is called
 Coordinator Application. But I really see nowhere that this Coordinator
 Application doesnt just run on a single machine, and that it will therefore
 not trigger anything if this machine is down. Can you confirm that the
 Coordinator Application-role is distributed in a distribued Oozie setup,
 so that jobs gets triggered even if one or two machines are down?

 Regards, Per Steffensen

 Ronen Itkin skrev:

  Hi

 Try to use Oozie for job coordination and work flows.



 On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk
 wrote:



 Hi

 I use hadoop for a MapReduce job in my system. I would like to have the
 job
 run very 5th minute. Are there any distributed timer job stuff in
 hadoop?
 Of course I could setup a timer in an external timer framework (CRON or
 something like that) that invokes the MapReduce job. But CRON is only
 running on one particular machine, so if that machine goes down my job
 will
 not be triggered. Then I could setup the timer on all or many machines,
 but
 I would not like the job to be run in more than one instance every 5th
 minute, so then the timer jobs would need to coordinate who is actually
 starting the job this time and all the rest would just have to do
 nothing.
 Guess I could come up with a solution to that - e.g. writing some lock
 stuff using HDFS files or by using ZooKeeper. But I would really like if
 someone had already solved the problem, and provided some kind of a
 distributed timer framework running in a cluster, so that I could
 just
 register a timer job with the cluster, and then be sure that it is
 invoked
 every 5th minute, no matter if one or two particular machines in the
 cluster
 is down.

 Any suggestions are very welcome.

 Regards, Per Steffensen












-- 
*
Ronen Itkin*
Taykey | www.taykey.com


I got the problem from Map output lost

2011-09-01 Thread Tu Tu
From this week,My Hadoop caught his problem with information as following:

Lost task tracker: tracker_rsync.host01:localhost/127.0.0.1:40759
Map output lost, rescheduling:
getMapOutput(attempt_201108021855_6734_m_97_1,2002) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_201108021855_6734/attempt_201108021855_6734_m_97_1/output/file.out.index
in any of the configured local directories
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at 
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2887)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)


To my application,there are 2 mappers and 2 reducers.And there may
be 2000 lost for mappers.So the total hadoop  had been delay for this lost.


Problem with Python + Hadoop: how to link .so outside Python?

2011-09-01 Thread Xiong Deng
Hi,

I have successfully installed scipy on my Python 2.7 on my local Linux, and
I want to pack my Python2.7 (with scipy) onto Hadoop and run my Python
MapReduce scripts,  like this:

 20 ${HADOOP_HOME}/bin/hadoop streaming \$
 21  -input ${input} \$
 22  -output ${output} \$
 23  -mapper python27/bin/python27.sh rp_extractMap.py \$
 24  -reducer python27/bin/python27.sh rp_extractReduce.py \$
 25  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
\$
 26  -file rp_extractMap.py \$
 27  -file rp_extractReduce.py \$
 28  -file shitu_conf.py \$
 29  -cacheArchive /share/python27.tar.gz#python27 \$
 30  -outputformat org.apache.hadoop.mapred.TextOutputFormat \$
 31  -inputformat org.apache.hadoop.mapred.CombineTextInputFormat \$
 32  -jobconf mapred.max.split.size=51200 \$
 33  -jobconf mapred.job.name=[reserve_price][rp_extract] \$
 34  -jobconf mapred.job.priority=HIGH \$
 35  -jobconf mapred.job.map.capacity=1000 \$
 36  -jobconf mapred.job.reduce.capacity=200 \$
 37  -jobconf mapred.reduce.tasks=200$
 38  -jobconf num.key.fields.for.partition=2$

I have to do this, because the Hadoop server installed its own python of
very low version which may not support some of my python scripts, and I do
not have privilege to install scipy lib on that server. So,I have to use the
-cacheArchieve command to include my own python2.7 with scipy

But, I find out that some of the .so in scipy are linked to other dynamic
libs outside Python2.7.. For example

$ ldd
~/local/python-2.7.2/lib/python2.7/site-packages/scipy/linalg/flapack.so
liblapack.so = /usr/local/atlas/lib/liblapack.so
(0x002a956fd000)
libatlas.so = /usr/local/atlas/lib/libatlas.so (0x002a95df3000)
libgfortran.so.3 =
/home/work/local/gcc-4.6.1/lib64/libgfortran.so.3 (0x002a9668d000)
libm.so.6 = /lib64/tls/libm.so.6 (0x002a968b6000)
libgcc_s.so.1 = /home/work/local/gcc-4.6.1/lib64/libgcc_s.so.1
(0x002a96a3c000)
libquadmath.so.0 =
/home/work/local/gcc-4.6.1/lib64/libquadmath.so.0 (0x002a96b51000)
libc.so.6 = /lib64/tls/libc.so.6 (0x002a96c87000)
libpthread.so.0 = /lib64/tls/libpthread.so.0 (0x002a96ebb000)
/lib64/ld-linux-x86-64.so.2 (0x00552000)


So, my question is: how can I include this libs? Should I search for all the
linked .so and .a under my local linux and pack them together with
Python2.7??? If yes, How can I get a full list of the libs needed and How
can make the packed Python2.7 know where to find the new libs??

Thanks
Xiong


Re: Timer jobs

2011-09-01 Thread Alejandro Abdelnur
[moving common-user@ to BCC]

Oozie is not HA yet. But it would be relatively easy to make it. It was
designed with that in mind, we even did a prototype.

Oozie consists of 2 services, a SQL database to store the Oozie jobs state
and a servlet container where Oozie app proper runs.

The solution for HA for the database, well, it is left to the database. This
means, you'll have to get an HA DB.

The solution for HA for the Oozie app is deploying the servlet container
with the Oozie app in more than one box (2 or 3); and front them by a HTTP
load-balancer.

The missing part is that the current Oozie lock-service is currently an
in-memory implementation. This should be replaced with a Zookeeper
implementation. Zookeeper could run externally or internally in all Oozie
servers. This is what was prototyped long ago.

Thanks.

Alejandro


On Thu, Sep 1, 2011 at 4:14 AM, Ronen Itkin ro...@taykey.com wrote:

 If I get you right you are asking about Installing Oozie as Distributed
 and/or HA cluster?!
 In that case I am not familiar with an out of the box solution by Oozie.
 But, I think you can made up a solution of your own, for example:
 Installing Oozie on two servers on the same partition which will be
 synchronized by DRBD.
 You can trigger a failover using linux Heartbeat and that way maintain a
 virtual IP.





 On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk
 wrote:

  Hi
 
  Thanks a lot for pointing me to Oozie. I have looked a little bit into
  Oozie and it seems like the component triggering jobs is called
  Coordinator Application. But I really see nowhere that this Coordinator
  Application doesnt just run on a single machine, and that it will
 therefore
  not trigger anything if this machine is down. Can you confirm that the
  Coordinator Application-role is distributed in a distribued Oozie
 setup,
  so that jobs gets triggered even if one or two machines are down?
 
  Regards, Per Steffensen
 
  Ronen Itkin skrev:
 
   Hi
 
  Try to use Oozie for job coordination and work flows.
 
 
 
  On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk
  wrote:
 
 
 
  Hi
 
  I use hadoop for a MapReduce job in my system. I would like to have the
  job
  run very 5th minute. Are there any distributed timer job stuff in
  hadoop?
  Of course I could setup a timer in an external timer framework (CRON or
  something like that) that invokes the MapReduce job. But CRON is only
  running on one particular machine, so if that machine goes down my job
  will
  not be triggered. Then I could setup the timer on all or many machines,
  but
  I would not like the job to be run in more than one instance every 5th
  minute, so then the timer jobs would need to coordinate who is actually
  starting the job this time and all the rest would just have to do
  nothing.
  Guess I could come up with a solution to that - e.g. writing some
 lock
  stuff using HDFS files or by using ZooKeeper. But I would really like
 if
  someone had already solved the problem, and provided some kind of a
  distributed timer framework running in a cluster, so that I could
  just
  register a timer job with the cluster, and then be sure that it is
  invoked
  every 5th minute, no matter if one or two particular machines in the
  cluster
  is down.
 
  Any suggestions are very welcome.
 
  Regards, Per Steffensen
 
 
 
 
 
 
 
 
 
 


 --
 *
 Ronen Itkin*
 Taykey | www.taykey.com



Re: Timer jobs

2011-09-01 Thread Per Steffensen

Thanks for your response. See comments below.

Regards, Per Steffensen

Alejandro Abdelnur skrev:

[moving common-user@ to BCC]

Oozie is not HA yet. But it would be relatively easy to make it. It was
designed with that in mind, we even did a prototype.
  
Ok, so if it isnt HA out-of-the-box I believe Oozie is too big a 
framework for my needs - I dont need all this workflow stuff - just a 
plain simple job trigger that triggers every 5th minute. I guess I will 
try out something smaller like Quartz Scheduler. It also only have 
HA/cluster support through JDBC (JobStore) but I guess I could fairly 
easy make a HDFSFilesJobStore which still hold the properties so that 
Quartz clustering works.


But what I would really like to have is a scheduling framework that is 
HA out-of-the-box. Guess Oozie is not the solution for me. Anyone knows 
about other frameworks?

Oozie consists of 2 services, a SQL database to store the Oozie jobs state
and a servlet container where Oozie app proper runs.

The solution for HA for the database, well, it is left to the database. This
means, you'll have to get an HA DB.
  
I would really like to avoid having to run a relational database. 
Couldnt I just do the persistence of Oozie jobs state in files on HDFS?

The solution for HA for the Oozie app is deploying the servlet container
with the Oozie app in more than one box (2 or 3); and front them by a HTTP
load-balancer.

The missing part is that the current Oozie lock-service is currently an
in-memory implementation. This should be replaced with a Zookeeper
implementation. Zookeeper could run externally or internally in all Oozie
servers. This is what was prototyped long ago.
  
Yes but if I have to do ZooKeeper stuff I could just do the scheduler 
myself and make run no all/many boxes. The only hard part about it is 
the locking thing that makes sure only one job-triggering happens in 
the entire cluster when only one job-triggering is supposed to happen, 
and that the job-triggering happens no matter how many machines might be 
down.

Thanks.

Alejandro


On Thu, Sep 1, 2011 at 4:14 AM, Ronen Itkin ro...@taykey.com wrote:

  

If I get you right you are asking about Installing Oozie as Distributed
and/or HA cluster?!
In that case I am not familiar with an out of the box solution by Oozie.
But, I think you can made up a solution of your own, for example:
Installing Oozie on two servers on the same partition which will be
synchronized by DRBD.
You can trigger a failover using linux Heartbeat and that way maintain a
virtual IP.





On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk
wrote:



Hi

Thanks a lot for pointing me to Oozie. I have looked a little bit into
Oozie and it seems like the component triggering jobs is called
Coordinator Application. But I really see nowhere that this Coordinator
Application doesnt just run on a single machine, and that it will
  

therefore


not trigger anything if this machine is down. Can you confirm that the
Coordinator Application-role is distributed in a distribued Oozie
  

setup,


so that jobs gets triggered even if one or two machines are down?

Regards, Per Steffensen

Ronen Itkin skrev:

 Hi
  

Try to use Oozie for job coordination and work flows.



On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk
wrote:





Hi

I use hadoop for a MapReduce job in my system. I would like to have the
job
run very 5th minute. Are there any distributed timer job stuff in
hadoop?
Of course I could setup a timer in an external timer framework (CRON or
something like that) that invokes the MapReduce job. But CRON is only
running on one particular machine, so if that machine goes down my job
will
not be triggered. Then I could setup the timer on all or many machines,
but
I would not like the job to be run in more than one instance every 5th
minute, so then the timer jobs would need to coordinate who is actually
starting the job this time and all the rest would just have to do
nothing.
Guess I could come up with a solution to that - e.g. writing some
  

lock


stuff using HDFS files or by using ZooKeeper. But I would really like
  

if


someone had already solved the problem, and provided some kind of a
distributed timer framework running in a cluster, so that I could
just
register a timer job with the cluster, and then be sure that it is
invoked
every 5th minute, no matter if one or two particular machines in the
cluster
is down.

Any suggestions are very welcome.

Regards, Per Steffensen



  





  

--
*
Ronen Itkin*
Taykey | www.taykey.com




  




Re: Creating a hive table for a custom log

2011-09-01 Thread Brock Noland
Hi,

On Thu, Sep 1, 2011 at 9:08 AM, Raimon Bosch raimon.bo...@gmail.com wrote:

 Hi,

 I'm trying to create a table similar to apache_log but I'm trying to avoid
 to write my own map-reduce task because I don't want to have my HDFS files
 twice.

 So if you're working with log lines like this:

 186.92.134.151 [31/Aug/2011:00:10:41 +] GET
 /client/action1/?transaction_id=8002user_id=87179311248ts=1314749223525item1=271item2=6045environment=2
 HTTP/1.1

 112.201.65.238 [31/Aug/2011:00:10:41 +] GET
 /client/action1/?transaction_id=9002ts=1314749223525user_id=9048871793100item2=6045item1=271environment=2
 HTTP/1.1

 90.45.198.251 [31/Aug/2011:00:10:41 +] GET
 /client/action2/?transaction_id=9022ts=1314749223525user_id=9048871793100item2=6045item1=271environment=2
 HTTP/1.1

 And having in mind that the parameters could be in different orders. Which
 will be the best strategy to create this table? Write my own
 org.apache.hadoop.hive.contrib.serde2? Is there any resource already
 implemented that I could use to perform this task?

I would use the regex serde to parse them:

CREATE EXTERNAL
TABLE access_log
(ip STRING,
dt STRING,
request STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (input.regex = ([\\d.]+)
\\[([\\w:/]+\\s[+\\-]\\d{4})\\] \(.+?)\)
LOCATION '/path/to/file';

That will parse the three fields out and could be modified to separate
out the action. Then I think you will need to parse the query string
in Hive itself.


 In the end the objective is convert all the parameters in fields and use as
 type the action. With this big table I will be able to perform my queries,
 my joins or my views.

 Any ideas?

 Thanks in Advance,
 Raimon Bosch.
 --
 View this message in context: 
 http://old.nabble.com/Creating-a-hive-table-for-a-custom-log-tp32379849p32379849.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Timer jobs

2011-09-01 Thread Tharindu Mathew
On Thu, Sep 1, 2011 at 7:58 PM, Per Steffensen st...@designware.dk wrote:

 Thanks for your response. See comments below.

 Regards, Per Steffensen

 Alejandro Abdelnur skrev:

  [moving common-user@ to BCC]

 Oozie is not HA yet. But it would be relatively easy to make it. It was
 designed with that in mind, we even did a prototype.


 Ok, so if it isnt HA out-of-the-box I believe Oozie is too big a framework
 for my needs - I dont need all this workflow stuff - just a plain simple job
 trigger that triggers every 5th minute. I guess I will try out something
 smaller like Quartz Scheduler. It also only have HA/cluster support through
 JDBC (JobStore) but I guess I could fairly easy make a HDFSFilesJobStore
 which still hold the properties so that Quartz clustering works.

 But what I would really like to have is a scheduling framework that is HA
 out-of-the-box. Guess Oozie is not the solution for me. Anyone knows about
 other frameworks?

This is similar to my requirement. Only that I already have Quartz
scheduling my jobs and haven't started using Hadoop yet. I plan to wrap
Quartz jobs to internally call Hadoop jobs. I'm still in the design phase
though. Hopefully, it will be successful.


  Oozie consists of 2 services, a SQL database to store the Oozie jobs state
 and a servlet container where Oozie app proper runs.

 The solution for HA for the database, well, it is left to the database.
 This
 means, you'll have to get an HA DB.


 I would really like to avoid having to run a relational database. Couldnt I
 just do the persistence of Oozie jobs state in files on HDFS?

  The solution for HA for the Oozie app is deploying the servlet container
 with the Oozie app in more than one box (2 or 3); and front them by a HTTP
 load-balancer.

 The missing part is that the current Oozie lock-service is currently an
 in-memory implementation. This should be replaced with a Zookeeper
 implementation. Zookeeper could run externally or internally in all Oozie
 servers. This is what was prototyped long ago.


 Yes but if I have to do ZooKeeper stuff I could just do the scheduler
 myself and make run no all/many boxes. The only hard part about it is the
 locking thing that makes sure only one job-triggering happens in the
 entire cluster when only one job-triggering is supposed to happen, and that
 the job-triggering happens no matter how many machines might be down.

  Thanks.

 Alejandro


 On Thu, Sep 1, 2011 at 4:14 AM, Ronen Itkin ro...@taykey.com wrote:



 If I get you right you are asking about Installing Oozie as Distributed
 and/or HA cluster?!
 In that case I am not familiar with an out of the box solution by Oozie.
 But, I think you can made up a solution of your own, for example:
 Installing Oozie on two servers on the same partition which will be
 synchronized by DRBD.
 You can trigger a failover using linux Heartbeat and that way maintain
 a
 virtual IP.





 On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk
 wrote:



 Hi

 Thanks a lot for pointing me to Oozie. I have looked a little bit into
 Oozie and it seems like the component triggering jobs is called
 Coordinator Application. But I really see nowhere that this
 Coordinator
 Application doesnt just run on a single machine, and that it will


 therefore


 not trigger anything if this machine is down. Can you confirm that the
 Coordinator Application-role is distributed in a distribued Oozie


 setup,


 so that jobs gets triggered even if one or two machines are down?

 Regards, Per Steffensen

 Ronen Itkin skrev:

  Hi


 Try to use Oozie for job coordination and work flows.



 On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk
 wrote:





 Hi

 I use hadoop for a MapReduce job in my system. I would like to have
 the
 job
 run very 5th minute. Are there any distributed timer job stuff in
 hadoop?
 Of course I could setup a timer in an external timer framework (CRON
 or
 something like that) that invokes the MapReduce job. But CRON is only
 running on one particular machine, so if that machine goes down my job
 will
 not be triggered. Then I could setup the timer on all or many
 machines,
 but
 I would not like the job to be run in more than one instance every 5th
 minute, so then the timer jobs would need to coordinate who is
 actually
 starting the job this time and all the rest would just have to do
 nothing.
 Guess I could come up with a solution to that - e.g. writing some


 lock


 stuff using HDFS files or by using ZooKeeper. But I would really like


 if


 someone had already solved the problem, and provided some kind of a
 distributed timer framework running in a cluster, so that I could
 just
 register a timer job with the cluster, and then be sure that it is
 invoked
 every 5th minute, no matter if one or two particular machines in the
 cluster
 is down.

 Any suggestions are very welcome.

 Regards, Per Steffensen












 --
 *
 Ronen Itkin*
 Taykey | www.taykey.com










-- 

Re: Timer jobs

2011-09-01 Thread Per Steffensen
Well I am not sure I get you right, but anyway, basically I want a timer 
framework that triggers my jobs. And the triggering of the jobs need to 
work even though one or two particular machines goes down. So the timer 
triggering mechanism has to live in the cluster, so to speak. What I 
dont want is that the timer framework are driven from one particular 
machine, so that the triggering of jobs will not happen if this 
particular machine goes down. Basically if I have e.g. 10 machines in a 
Hadoop cluster I will be able to run e.g. MapReduce jobs even if 3 of 
the 10 machines are down. I want my timer framework to also be 
clustered, distributed and coordinated, so that I will also have my 
timer jobs triggered even though 3 out of 10 machines are down.


Regards, Per Steffensen

Ronen Itkin skrev:

If I get you right you are asking about Installing Oozie as Distributed
and/or HA cluster?!
In that case I am not familiar with an out of the box solution by Oozie.
But, I think you can made up a solution of your own, for example:
Installing Oozie on two servers on the same partition which will be
synchronized by DRBD.
You can trigger a failover using linux Heartbeat and that way maintain a
virtual IP.





On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk wrote:

  

Hi

Thanks a lot for pointing me to Oozie. I have looked a little bit into
Oozie and it seems like the component triggering jobs is called
Coordinator Application. But I really see nowhere that this Coordinator
Application doesnt just run on a single machine, and that it will therefore
not trigger anything if this machine is down. Can you confirm that the
Coordinator Application-role is distributed in a distribued Oozie setup,
so that jobs gets triggered even if one or two machines are down?

Regards, Per Steffensen

Ronen Itkin skrev:

 Hi


Try to use Oozie for job coordination and work flows.



On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk
wrote:



  

Hi

I use hadoop for a MapReduce job in my system. I would like to have the
job
run very 5th minute. Are there any distributed timer job stuff in
hadoop?
Of course I could setup a timer in an external timer framework (CRON or
something like that) that invokes the MapReduce job. But CRON is only
running on one particular machine, so if that machine goes down my job
will
not be triggered. Then I could setup the timer on all or many machines,
but
I would not like the job to be run in more than one instance every 5th
minute, so then the timer jobs would need to coordinate who is actually
starting the job this time and all the rest would just have to do
nothing.
Guess I could come up with a solution to that - e.g. writing some lock
stuff using HDFS files or by using ZooKeeper. But I would really like if
someone had already solved the problem, and provided some kind of a
distributed timer framework running in a cluster, so that I could
just
register a timer job with the cluster, and then be sure that it is
invoked
every 5th minute, no matter if one or two particular machines in the
cluster
is down.

Any suggestions are very welcome.

Regards, Per Steffensen








  




  




Re: Binary content

2011-09-01 Thread Mohit Anchlia
On Thu, Sep 1, 2011 at 1:25 AM, Dieter Plaetinck
dieter.plaeti...@intec.ugent.be wrote:
 On Wed, 31 Aug 2011 08:44:42 -0700
 Mohit Anchlia mohitanch...@gmail.com wrote:

 Does map-reduce work well with binary contents in the file? This
 binary content is basically some CAD files and map reduce program need
 to read these files using some proprietry tool extract values and do
 some processing. Wondering if there are others doing similar type of
 processing. Best practices etc.

 yes, it works.  you just need to select the right input format.
 Personally i store all my binary files into a sequencefile (because my binary 
 files are small)

Thanks! Is there a specific tutorial I can focus on to see how it could be done?

 Dieter



Re: Timer jobs

2011-09-01 Thread Tharindu Mathew
In Hadoop, if the client that triggers the job fails, is there a way to
recover and another client to submit the job?

On Thu, Sep 1, 2011 at 8:44 PM, Per Steffensen st...@designware.dk wrote:

 Well I am not sure I get you right, but anyway, basically I want a timer
 framework that triggers my jobs. And the triggering of the jobs need to work
 even though one or two particular machines goes down. So the timer
 triggering mechanism has to live in the cluster, so to speak. What I dont
 want is that the timer framework are driven from one particular machine, so
 that the triggering of jobs will not happen if this particular machine goes
 down. Basically if I have e.g. 10 machines in a Hadoop cluster I will be
 able to run e.g. MapReduce jobs even if 3 of the 10 machines are down. I
 want my timer framework to also be clustered, distributed and coordinated,
 so that I will also have my timer jobs triggered even though 3 out of 10
 machines are down.


 Regards, Per Steffensen

 Ronen Itkin skrev:

 If I get you right you are asking about Installing Oozie as Distributed
 and/or HA cluster?!
 In that case I am not familiar with an out of the box solution by Oozie.
 But, I think you can made up a solution of your own, for example:
 Installing Oozie on two servers on the same partition which will be
 synchronized by DRBD.
 You can trigger a failover using linux Heartbeat and that way maintain a
 virtual IP.





 On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk
 wrote:



 Hi

 Thanks a lot for pointing me to Oozie. I have looked a little bit into
 Oozie and it seems like the component triggering jobs is called
 Coordinator Application. But I really see nowhere that this Coordinator
 Application doesnt just run on a single machine, and that it will
 therefore
 not trigger anything if this machine is down. Can you confirm that the
 Coordinator Application-role is distributed in a distribued Oozie
 setup,
 so that jobs gets triggered even if one or two machines are down?

 Regards, Per Steffensen

 Ronen Itkin skrev:

  Hi


 Try to use Oozie for job coordination and work flows.



 On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk
 wrote:





 Hi

 I use hadoop for a MapReduce job in my system. I would like to have the
 job
 run very 5th minute. Are there any distributed timer job stuff in
 hadoop?
 Of course I could setup a timer in an external timer framework (CRON or
 something like that) that invokes the MapReduce job. But CRON is only
 running on one particular machine, so if that machine goes down my job
 will
 not be triggered. Then I could setup the timer on all or many machines,
 but
 I would not like the job to be run in more than one instance every 5th
 minute, so then the timer jobs would need to coordinate who is actually
 starting the job this time and all the rest would just have to do
 nothing.
 Guess I could come up with a solution to that - e.g. writing some
 lock
 stuff using HDFS files or by using ZooKeeper. But I would really like
 if
 someone had already solved the problem, and provided some kind of a
 distributed timer framework running in a cluster, so that I could
 just
 register a timer job with the cluster, and then be sure that it is
 invoked
 every 5th minute, no matter if one or two particular machines in the
 cluster
 is down.

 Any suggestions are very welcome.

 Regards, Per Steffensen




















-- 
Regards,

Tharindu


Re: Binary content

2011-09-01 Thread Owen O'Malley
On Thu, Sep 1, 2011 at 8:37 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

Thanks! Is there a specific tutorial I can focus on to see how it could be
 done?


Take the word count example and change its output format to be
SequenceFileOutputFormat.

job.setOutputFormatClass(SequenceFileOutputFormat.class);

and it will generate SequenceFiles instead of text. There is
SequenceFileInputFormat for reading.

-- Owen


Re: Timer jobs

2011-09-01 Thread Vitalii Tymchyshyn

01.09.11 18:14, Per Steffensen написав(ла):
Well I am not sure I get you right, but anyway, basically I want a 
timer framework that triggers my jobs. And the triggering of the jobs 
need to work even though one or two particular machines goes down. So 
the timer triggering mechanism has to live in the cluster, so to 
speak. What I dont want is that the timer framework are driven from 
one particular machine, so that the triggering of jobs will not happen 
if this particular machine goes down. Basically if I have e.g. 10 
machines in a Hadoop cluster I will be able to run e.g. MapReduce jobs 
even if 3 of the 10 machines are down. I want my timer framework to 
also be clustered, distributed and coordinated, so that I will also 
have my timer jobs triggered even though 3 out of 10 machines are down.

Hello.

AFAIK now you still have HDFS NameNode and as soon as NameNode is down - 
your cluster is down. So, putting scheduling on the same machine as 
NameNode won't make you cluster worse in terms of SPOF (at least for HW 
failures).


Best regards, Vitalii Tymchyshyn


cross product of 2 data sets

2011-09-01 Thread Marc Sturlese
Hey there,
I would like to do the cross product of two data sets, any of them feeds in
memory. I've seen pig has the cross operation. Can someone please explain me
how it implements it?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/cross-product-of-2-data-sets-tp3302160p3302160.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: cross product of 2 data sets

2011-09-01 Thread Alan Gates
http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html
search on cross matches

Alan.

On Sep 1, 2011, at 11:44 AM, Marc Sturlese wrote:

 Hey there,
 I would like to do the cross product of two data sets, any of them feeds in
 memory. I've seen pig has the cross operation. Can someone please explain me
 how it implements it?
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/cross-product-of-2-data-sets-tp3302160p3302160.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: Timer jobs

2011-09-01 Thread Per Steffensen

Vitalii Tymchyshyn skrev:

01.09.11 18:14, Per Steffensen написав(ла):
Well I am not sure I get you right, but anyway, basically I want a 
timer framework that triggers my jobs. And the triggering of the jobs 
need to work even though one or two particular machines goes down. So 
the timer triggering mechanism has to live in the cluster, so to 
speak. What I dont want is that the timer framework are driven from 
one particular machine, so that the triggering of jobs will not 
happen if this particular machine goes down. Basically if I have e.g. 
10 machines in a Hadoop cluster I will be able to run e.g. MapReduce 
jobs even if 3 of the 10 machines are down. I want my timer framework 
to also be clustered, distributed and coordinated, so that I will 
also have my timer jobs triggered even though 3 out of 10 machines 
are down.

Hello.

AFAIK now you still have HDFS NameNode and as soon as NameNode is down 
- your cluster is down. So, putting scheduling on the same machine as 
NameNode won't make you cluster worse in terms of SPOF (at least for 
HW failures).


Best regards, Vitalii Tymchyshyn


I believe this is why there is also a secondary namenode. But with two 
namenodes it is still to centralized in my opinion, but guess Hadoop 
people know that, and that the namenode-role will be even more 
distributed in the future. But that does not change the fact that I 
would like to have a real distributed clustered scheduler.


MultipleOutputs - Create multiple files during output

2011-09-01 Thread modemide
Hi all,
I was wondering if anyone was familiar with this class.  I want to
create multiple output files during my reduce.

My input files will consist of
name1action1date1
name1action2date2
name1action3date3

name2action1date1
name2action2date2
name2action3date3


My goal is to create files with the following format
Filename:
name_Date:CCYYMM

File Contents:
action1
action2
action3


I.e. This will store all the actions of one person for any given month
in one file.

I just don't know how I will decide the file name at run time.  Can anyone help?

Thanks,
Tim


Namenode not starting

2011-09-01 Thread abhishek sharma
Hi all,

I am trying to install Hadoop (release 0.20.203) on a machine with CentOS.

When I try to start HDFS, I get the following error.

machine-name: Unrecognized option: -jvm
machine-name: Could not create the Java virtual machine.

Any idea what might be the problem?

Thanks,
Abhishek


Re: Namenode not starting

2011-09-01 Thread abhishek sharma
Hi Hailong,

I have installed JDK and set JAVA_HOME correctly (as far as I know).

Output of java -version is:
java version 1.6.0_04
Java(TM) SE Runtime Environment (build 1.6.0_04-b12)
Java HotSpot(TM) Server VM (build 10.0-b19, mixed mode)

I also have another version installed 1.6.0_27 but get same error with it.

Abhishek

On Thu, Sep 1, 2011 at 4:00 PM, hailong.yang1115
hailong.yang1...@gmail.com wrote:
 Hi abhishek,

 Have you successfully installed java virtual machine like sun JDK before 
 running Hadoop? Or maybe you forget to configure the environment variable 
 JAVA_HOME? What is the output of the command 'java -version'?

 Regards

 Hailong




 ***
 * Hailong Yang, PhD. Candidate
 * Sino-German Joint Software Institute,
 * School of Computer ScienceEngineering, Beihang University
 * Phone: (86-010)82315908
 * Email: hailong.yang1...@gmail.com
 * Address: G413, New Main Building in Beihang University,
 *              No.37 XueYuan Road,HaiDian District,
 *              Beijing,P.R.China,100191
 ***

 From: abhishek sharma
 Date: 2011-09-02 03:51
 To: common-user; common-dev
 Subject: Namenode not starting
 Hi all,

 I am trying to install Hadoop (release 0.20.203) on a machine with CentOS.

 When I try to start HDFS, I get the following error.

 machine-name: Unrecognized option: -jvm
 machine-name: Could not create the Java virtual machine.

 Any idea what might be the problem?

 Thanks,
 Abhishek


Re: Namenode not starting

2011-09-01 Thread abhishek sharma
Actually, I found the reason. I am running HDFS as root and there is
a bug that has recently been fixed.

https://issues.apache.org/jira/browse/HDFS-1943

Thanks,
Abhishek

On Thu, Sep 1, 2011 at 6:25 PM, Ravi Prakash ravihad...@gmail.com wrote:
 Hi Abhishek,

 Try reading through the shell scripts before postiing. They are short and
 simple enough and you should be able to debug them quite easily. I've seen
 the same error many times.

 Do you see JAVA_HOME set when you $ssh localhost?

 Also which command are you using to start the daemons?

 Fight on,
 Ravi

 On Thu, Sep 1, 2011 at 4:35 PM, abhishek sharma absha...@usc.edu wrote:

 Hi Hailong,

 I have installed JDK and set JAVA_HOME correctly (as far as I know).

 Output of java -version is:
 java version 1.6.0_04
 Java(TM) SE Runtime Environment (build 1.6.0_04-b12)
 Java HotSpot(TM) Server VM (build 10.0-b19, mixed mode)

 I also have another version installed 1.6.0_27 but get same error with
 it.

 Abhishek

 On Thu, Sep 1, 2011 at 4:00 PM, hailong.yang1115
 hailong.yang1...@gmail.com wrote:
  Hi abhishek,
 
  Have you successfully installed java virtual machine like sun JDK before
 running Hadoop? Or maybe you forget to configure the environment variable
 JAVA_HOME? What is the output of the command 'java -version'?
 
  Regards
 
  Hailong
 
 
 
 
  ***
  * Hailong Yang, PhD. Candidate
  * Sino-German Joint Software Institute,
  * School of Computer ScienceEngineering, Beihang University
  * Phone: (86-010)82315908
  * Email: hailong.yang1...@gmail.com
  * Address: G413, New Main Building in Beihang University,
  *              No.37 XueYuan Road,HaiDian District,
  *              Beijing,P.R.China,100191
  ***
 
  From: abhishek sharma
  Date: 2011-09-02 03:51
  To: common-user; common-dev
  Subject: Namenode not starting
  Hi all,
 
  I am trying to install Hadoop (release 0.20.203) on a machine with
 CentOS.
 
  When I try to start HDFS, I get the following error.
 
  machine-name: Unrecognized option: -jvm
  machine-name: Could not create the Java virtual machine.
 
  Any idea what might be the problem?
 
  Thanks,
  Abhishek




Re: TestDFSIO failure

2011-09-01 Thread Ken Krugler
Hi Matt,

On Jun 20, 2011, at 1:46pm, GOEKE, MATTHEW (AG/1000) wrote:

 Has anyone else run into issues using output compression (in our case lzo) on 
 TestDFSIO and it failing to be able to read the metrics file? I just assumed 
 that it would use the correct decompression codec after it finishes but it 
 always returns with a 'File not found' exception.

Yes, I've run into the same issue on 0.20.2 and CHD3u0

I don't see any Jira issue that covers this problem, so unless I hear otherwise 
I'll file one.

The problem is that the post-job code doesn't handle getting the path.deflate 
or path.lzo (for you) file from HDFS, and then decompressing it.

 Is there a simple way around this without spending the time to recompile a 
 cluster/codec specific version?


You can use hadoop fs -text path reported in exception.lzo

This will dump out the file, which looks like:

f:rate  171455.11
f:sqrate2981174.8
l:size  1048576
l:tasks 10
l:time  590537

If you take f:rate/1000/l:tasks, that should give you the average MB/sec.

E.g. for the example above, that would be 171455/1000/10 = 17MB/sec.

-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions  training
Hadoop, Cascading, Mahout  Solr





Re: MultipleOutputs - Create multiple files during output

2011-09-01 Thread Stan Rosenberg
Hi Tim,

You could create a custom HashPartitioner so that all key,value pairs
denoting the actions of the same user end up in the same reducer; then you
need
only one output file per reducer.  Btw, how large are the output files? make
sure you don't end up creating
a lot of small files, i.e.,  64MB.

Best,

stan

On Thu, Sep 1, 2011 at 3:47 PM, modemide modem...@gmail.com wrote:

 Hi all,
 I was wondering if anyone was familiar with this class.  I want to
 create multiple output files during my reduce.

 My input files will consist of
 name1action1date1
 name1action2date2
 name1action3date3

 name2action1date1
 name2action2date2
 name2action3date3


 My goal is to create files with the following format
 Filename:
 name_Date:CCYYMM

 File Contents:
 action1
 action2
 action3


 I.e. This will store all the actions of one person for any given month
 in one file.

 I just don't know how I will decide the file name at run time.  Can anyone
 help?

 Thanks,
 Tim