Long running Yarn Applications on a secured HA cluster?

2016-01-28 Thread Niels Basjes
Hi,

I'm working on a project that uses Apache Flink (stream processing) on top
of a secured HA Yarn cluster.
The test application I've been testing with just uses HBase (it writes the
current time in a column every minute).

The problem I have is that after 173.5 hours (exactly) my application dies.
The best assessment we have right now is that the Hadoop Delegation Tokens
are expiring.
I know for sure the Kerberos tickets are correctly renewed/recreated in the
cluster using my keytab file because I had our IT-ops guys drop the max
ticket life to 5 minutes and the max renew to 10 minutes.

Following what we found on these two web sites we set those settings (I
posted the current settings that seem relevant below).

http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html

https://forge.puppetlabs.com/cesnet/hadoop/2.1.0#long-running-applications

Yet this has not changed the situation, the job still dies after 173.5
hours with this exception.

15:47:55,283 INFO  org.apache.flink.yarn.YarnJobManager
  - Status of job 2e4a3516d8e4876b705eaff4a52fc272 (Long
running Flink application) changed to FAILING.
java.lang.Exception: Serialized representation of
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
Failed 1 action: FailedServerException: 1 time,
at 
org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:224)
at 
org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(AsyncProcess.java:204)
at 
org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1597)
at 
org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:1069)
at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1344)
at org.apache.hadoop.hbase.client.HTable.put(HTable.java:1001)
at 
nl.basjes.flink.experiments.SetHBaseRowSink.invoke(SetHBaseRowSink.java:58)


Me and my colleagues did some searching and these two seem to describe a
similar problem to what we see (just instead of HBase these reports are
about HDFS):

Failed to Update HDFS Delegation Token for long running application in HA
mode
   https://issues.apache.org/jira/browse/HDFS-9276
HDFS Delegation Token will be expired when calling
"UserGroupInformation.getCurrentUser.addCredentials" in HA mode
   https://issues.apache.org/jira/browse/SPARK-11182


My question to you guys is simply put: How do I fix this problem?
How do I figure out what the problem really is?

Thanks for any suggestions you have for us.



yarn.resourcemanager.proxy-user-privileges.enabled
true
yarn-site.xml



dfs.namenode.delegation.token.renew-interval
8640
hdfs-default.xml



dfs.namenode.delegation.key.update-interval
8640
hdfs-default.xml



dfs.namenode.delegation.token.max-lifetime
60480
hdfs-default.xml




yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled
true
yarn-default.xml




yarn.resourcemanager.delayed.delegation-token.removal-interval-ms
3
yarn-default.xml



hadoop.proxyuser.hbase.hosts
*
core-site.xml



hadoop.proxyuser.hbase.groups
*
core-site.xml




hadoop.proxyuser.yarn.hosts
*
core-site.xml



hadoop.proxyuser.yarn.groups
*
core-site.xml



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Flink job on secure Yarn fails after many hours

2015-12-02 Thread Niels Basjes
Hi,


We have a Kerberos secured Yarn cluster here and I'm experimenting
with Apache Flink on top of that.

A few days ago I started a very simple Flink application (just stream
the time as a String into HBase 10 times per second).

I (deliberately) asked our IT-ops guys to make my account have a max
ticket time of 5 minutes and a max renew time of 10 minutes (yes,
ridiculously low timeout values because I needed to validate this
https://issues.apache.org/jira/browse/FLINK-2977).

This job is started with a keytab file and after running for 31 hours
it suddenly failed with the exception you see below.

I had the same job running for almost 400 hours until that failed too
(I was too late to check the logfiles but I suspect the same problem).


So in that time span my tickets have expired and new tickets have been
obtained several hundred times.


The main error I see is that in the process of a ticket expiring and
being renewed I see this message:

 *Not retrying because the invoked method is not idempotent, and
unable to determine whether it was invoked*


Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )

Flink is version 0.10.1


How do I fix this?
Is this a bug (in either Hadoop or Flink) or am I doing something wrong?
Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?


Niels Basjes



21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
  - PriviledgedActionException as:nbasjes (auth:SIMPLE)
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
Invalid AMRMToken from appattempt_1443166961758_163901_01
21:30:27,861 WARN  org.apache.hadoop.ipc.Client
  - Exception encountered while connecting to the server :
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
Invalid AMRMToken from appattempt_1443166961758_163901_01
21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
  - PriviledgedActionException as:nbasjes (auth:SIMPLE)
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
Invalid AMRMToken from
appattempt_1443166961758_163901_01*21:30:27,891 WARN
org.apache.hadoop.io.retry.RetryInvocationHandler -
Exception while invoking class
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
Not retrying because the invoked method is not idempotent, and unable
to determine whether it was invoked
*org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
AMRMToken from appattempt_1443166961758_163901_01
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy14.allocate(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
at 
org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
at 
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465

Re: Not able to run more than one map task

2015-04-10 Thread Niels Basjes
Just curious: what is the input for your job ? If it is a single gzipped
file then that is the cause of getting exactly 1 mapper.

Niels

On Fri, Apr 10, 2015, 09:21 Amit Kumar amiti...@msn.com wrote:

 Thanks a lot Harsha for replying

 This problem has waster at least last one week.

 We tried what you suggested. Could you please take a look at the
 configuration and suggest if we missed c?

 System RAM : 8GB
 CPU : 4 threads each with 2 cores.
 # Disks : 1

 MR2:

 mapreduce.map.memory.mb : 512
 mapreduce.tasktracker.map.tasks.maximum : 4

 Yarn:

 yarn.app.mapreduce.am.resource.mb : 512
 yarn.nodemanager.resource.cpu-vcores : 4
 yarn.scheduler.minimum-allocation-mb : 512
 yarn.nodemanager.resource.memory-mb : 5080


 Regards,
 Amit



  From: ha...@cloudera.com
  Date: Fri, 10 Apr 2015 10:20:24 +0530
  Subject: Re: Not able to run more than one map task
  To: user@hadoop.apache.org

 
  You are likely memory/vcore starved in the NM's configs. Increase your
  yarn.nodemanager.resource.memory-mb and
  yarn.nodemanager.resource.cpu-vcores configs, or consider lowering the
  MR job memory request values to gain more parallelism.
 
  On Thu, Apr 9, 2015 at 5:05 PM, Amit Kumar amiti...@msn.com wrote:
   Hi All,
  
   We recently started working on Hadoop. We have setup the hadoop in
 pseduo
   distribution mode along with oozie.
  
   Every developer has set it up on his laptop. The problem is that we
 are not
   able to run more than one map task concurrently on our laptops.
 Resource
   manager is not allowing more than one task on our machine.
  
   My task gets completed if I submit it without Oozie. Oozie requires
 one map
   task for its own functioning. Actual task that oozie submit does not
 start.
  
   Here is my configuration
  
   -- Hadoop setup in Pseudo distribution mode
   -- Hadoop Version - 2.6
   -- Oozie Version - 4.0.1
  
   Regards,
   Amit
 
 
 
  --
  Harsh J



Re: way to add custom udf jar in hadoop 2.x version

2015-01-04 Thread Niels Basjes
Hi,

These options:
- HIVE_HOME/auxlib
- http://stackoverflow.com/questions/14032924/how-to-add-serde-jar
- ADD JAR commands in your $HOME/.hiverc file

either require IT operations to put my JAR on all nodes OR I cannot share
it, Only works on the commandline and it won't work in HUE/Beeswax.

Now Permanent Functions:
- https://issues.apache.org/jira/browse/HIVE-6047
-
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PermanentFunctions

What these Permanent Functions do is:
1) put the jar on the cluster without IT operations putting the jar on all
nodes
2) the jar is used transparently for everyone who want to use the function.

I am writing a deserializer [1] (Not finished yet:
https://github.com/nielsbasjes/logparser/blob/master/README-Hive.md) that
should make existing files query-able as an external table in Hive.

Question is: Is there something similar for CREATE EXTERNAL TABLE ??

Something like

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
...
STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
[USING JAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ];


Is this something for which there is already a JIRA (couldn't find it)?
If not; Should I create one? (I.e. do you think this would make sense for
others?)

Niels Basjes


On Fri, Jan 2, 2015 at 9:00 PM, Yakubovich, Alexey 
alexey.yakubov...@searshc.com wrote:

  Try to  look hr:
 http://stackoverflow.com/questions/14032924/how-to-add-serde-jar

  Another advice: insert your ADD JAR commands in your $HOME/.hiverc file
 and start hive. (
 http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3ccamgr+0h3smdw4zhtpyo5b1b4iob05bpw8ls+daeh595qzid...@mail.gmail.com%3E
 )



   From: Ted Yu yuzhih...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Wednesday, December 31, 2014 at 8:25 AM
 To: d...@hive.apache.org d...@hive.apache.org
 Subject: Fwd: way to add custom udf jar in hadoop 2.x version

   Forwarding Niels' question to hive mailing list.

 On Wed, Dec 31, 2014 at 1:24 AM, Niels Basjes ni...@basjes.nl wrote:

 Thanks for the pointer.
 This seems to work for functions. Is there something similar for CREATE
 EXTERNAL TABLE ??

 Niels
   On Dec 31, 2014 8:13 AM, Ted Yu yuzhih...@gmail.com wrote:

  Have you seen this thread ?

 http://search-hadoop.com/m/8er9TcALc/Hive+udf+custom+jarsubj=Best+way+to+add+custom+UDF+jar+in+HiveServer2

 On Dec 30, 2014, at 10:56 PM, reena upadhyay reena2...@gmail.com
 wrote:

   Hi,

  I am using hadoop 2.4.0 version. I have created custom udf jar. I am
 trying to execute a simple select udf query using java hive jdbc client
 program. When hive execute the query using map reduce job, then the query
 execution get fails because the mapper is not able to locate the udf class.
 So I wanted to add the udf jar in hadoop environment permanently. Please
 suggest me a way to add this external jar for single node and multi node
 hadoop cluster.

  PS: I am using hive 0.13.1 version and I already have this custom udf
 jar added in HIVE_HOME/lib directory.


  Thanks


This message, including any attachments, is the property of Sears
 Holdings Corporation and/or one of its subsidiaries. It is confidential and
 may contain proprietary or legally privileged information. If you are not
 the intended recipient, please delete it without reading the contents.
 Thank you.




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: way to add custom udf jar in hadoop 2.x version

2015-01-04 Thread Niels Basjes
I created https://issues.apache.org/jira/browse/HIVE-9252 for this
improvement.

On Sun, Jan 4, 2015 at 5:16 PM, Niels Basjes ni...@basjes.nl wrote:

 Hi,

 These options:
 - HIVE_HOME/auxlib
 - http://stackoverflow.com/questions/14032924/how-to-add-serde-jar
 - ADD JAR commands in your $HOME/.hiverc file

 either require IT operations to put my JAR on all nodes OR I cannot share
 it, Only works on the commandline and it won't work in HUE/Beeswax.

 Now Permanent Functions:
 - https://issues.apache.org/jira/browse/HIVE-6047
 -
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PermanentFunctions

 What these Permanent Functions do is:
 1) put the jar on the cluster without IT operations putting the jar on all
 nodes
 2) the jar is used transparently for everyone who want to use the function.

 I am writing a deserializer [1] (Not finished yet:
 https://github.com/nielsbasjes/logparser/blob/master/README-Hive.md) that
 should make existing files query-able as an external table in Hive.

 Question is: Is there something similar for CREATE EXTERNAL TABLE ??

 Something like

 CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
 ...
 STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
 [USING JAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ];


 Is this something for which there is already a JIRA (couldn't find it)?
 If not; Should I create one? (I.e. do you think this would make sense for
 others?)

 Niels Basjes


 On Fri, Jan 2, 2015 at 9:00 PM, Yakubovich, Alexey 
 alexey.yakubov...@searshc.com wrote:

  Try to  look hr:
 http://stackoverflow.com/questions/14032924/how-to-add-serde-jar

  Another advice: insert your ADD JAR commands in your $HOME/.hiverc file
 and start hive. (
 http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3ccamgr+0h3smdw4zhtpyo5b1b4iob05bpw8ls+daeh595qzid...@mail.gmail.com%3E
 )



   From: Ted Yu yuzhih...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Wednesday, December 31, 2014 at 8:25 AM
 To: d...@hive.apache.org d...@hive.apache.org
 Subject: Fwd: way to add custom udf jar in hadoop 2.x version

   Forwarding Niels' question to hive mailing list.

 On Wed, Dec 31, 2014 at 1:24 AM, Niels Basjes ni...@basjes.nl wrote:

 Thanks for the pointer.
 This seems to work for functions. Is there something similar for CREATE
 EXTERNAL TABLE ??

 Niels
   On Dec 31, 2014 8:13 AM, Ted Yu yuzhih...@gmail.com wrote:

  Have you seen this thread ?

 http://search-hadoop.com/m/8er9TcALc/Hive+udf+custom+jarsubj=Best+way+to+add+custom+UDF+jar+in+HiveServer2

 On Dec 30, 2014, at 10:56 PM, reena upadhyay reena2...@gmail.com
 wrote:

   Hi,

  I am using hadoop 2.4.0 version. I have created custom udf jar. I am
 trying to execute a simple select udf query using java hive jdbc client
 program. When hive execute the query using map reduce job, then the query
 execution get fails because the mapper is not able to locate the udf class.
 So I wanted to add the udf jar in hadoop environment permanently.
 Please suggest me a way to add this external jar for single node and multi
 node hadoop cluster.

  PS: I am using hive 0.13.1 version and I already have this custom udf
 jar added in HIVE_HOME/lib directory.


  Thanks


This message, including any attachments, is the property of Sears
 Holdings Corporation and/or one of its subsidiaries. It is confidential and
 may contain proprietary or legally privileged information. If you are not
 the intended recipient, please delete it without reading the contents.
 Thank you.




 --
 Best regards / Met vriendelijke groeten,

 Niels Basjes




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: way to add custom udf jar in hadoop 2.x version

2014-12-31 Thread Niels Basjes
Thanks for the pointer.
This seems to work for functions. Is there something similar for CREATE
EXTERNAL TABLE ??

Niels
On Dec 31, 2014 8:13 AM, Ted Yu yuzhih...@gmail.com wrote:

 Have you seen this thread ?

 http://search-hadoop.com/m/8er9TcALc/Hive+udf+custom+jarsubj=Best+way+to+add+custom+UDF+jar+in+HiveServer2

 On Dec 30, 2014, at 10:56 PM, reena upadhyay reena2...@gmail.com wrote:

 Hi,

 I am using hadoop 2.4.0 version. I have created custom udf jar. I am
 trying to execute a simple select udf query using java hive jdbc client
 program. When hive execute the query using map reduce job, then the query
 execution get fails because the mapper is not able to locate the udf class.
 So I wanted to add the udf jar in hadoop environment permanently. Please
 suggest me a way to add this external jar for single node and multi node
 hadoop cluster.

 PS: I am using hive 0.13.1 version and I already have this custom udf jar
 added in HIVE_HOME/lib directory.


 Thanks




Re: to all this unsubscribe sender

2014-12-05 Thread Niels Basjes
Yes, I agree. We should accept people as they are.
So perhaps we should increase the hurdle to subscribe in the first place?
Something like adding a question like What do you do if you want to
unsubscribe from a mailing list?

That way the people who are lazy or need hand holding are simply unable to
subscribe, thus avoiding these people who send these unsubscribe me
messages from ever entering the list.


On Fri, Dec 5, 2014 at 2:49 PM, mark charts mcha...@yahoo.com wrote:

 Trained as an electrical engineer I learned again and again from my
 instructors: man is inherently lazy. Maybe that's the case. Also, some
 people have complicated lives and need hand holding. It is best to accept
 people as they are.


   On Friday, December 5, 2014 8:40 AM, Chana chana.ole...@gmail.com
 wrote:


 Actually, it's capisce.

 And, obviously, we are all here because we are learning. But, now that
 instructions have been posted again - I would expect that people would not
 expect to have their
 hands held to do something so simple as unsubscribe. They are supposedly
 technologists. I would think they would be able to do whatever research is
 necessary to
 unsubscribe without flooding the list with requests. It's a fairly common
 operation for anyone who has been on the web for a while.



 On Fri, Dec 5, 2014 at 7:35 AM, mark charts mcha...@yahoo.com wrote:

 Not rude at all. We were not born with any knowledge. We learn as we go.
 We learn by asking. Capiche?


 Mark Charts


   On Friday, December 5, 2014 8:13 AM, Chana chana.ole...@gmail.com
 wrote:


 THANK YOU for posting this Very rude indeed to not learn *for oneself*
 - as opposed to expecting to be spoon fed information like a small child.

 On Fri, Dec 5, 2014 at 7:00 AM, Aleks Laz al-userhad...@none.at wrote:

 Dear wished unsubscriber

 Is it really that hard to start brain vX.X?

 You or anyone which have used your email account, have subscribed to this
 list, as described here.

 http://hadoop.apache.org/mailing_lists.html

 You or this Person must unsubscribe at the same way as the subscription
 was, as described here.

 http://hadoop.apache.org/mailing_lists.html

 In the case that YOU don't know how a mailinglist works, please take a
 look here.

 http://en.wikipedia.org/wiki/Electronic_mailing_list

 Even the month is young there are a lot of

 unsubscribe

 in the archive.

 http://mail-archives.apache.org/mod_mbox/hadoop-user/201412.mbox/thread

 I'm new to this list but from my point of view it is very disrespectful to
 the list members and developers that YOU don't invest a little bit of time
 by your self to search how you can unsubscribe from a list on which YOU
 have subscribed or anyone which have used your email account.

 cheers Aleks










-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: to all this unsubscribe sender

2014-12-05 Thread Niels Basjes
+1

On Fri, Dec 5, 2014 at 4:28 PM, Ted Yu yuzhih...@gmail.com wrote:

 +1

 On Fri, Dec 5, 2014 at 7:22 AM, Aleks Laz al-userhad...@none.at wrote:

 +1

 Am 05-12-2014 16:12, schrieb mark charts:

  I concur. Good idea.


   On Friday, December 5, 2014 10:10 AM, Amjad Syed amjad...@gmail.com
 wrote:


   My take on this is simple. The owner/maintainer / administrator of the
 list can implement a filter where if the subject of the email has word
 unsubscribe . That email gets blocked and sender gets automated reply
 stating that if you want to unsubscribe use the unsubscribe  list.
  On 5 Dec 2014 18:05, Niels Basjes ni...@basjes.nl wrote:

  Yes, I agree. We should accept people as they are.
 So perhaps we should increase the hurdle to subscribe in the first place?
 Something like adding a question like What do you do if you want to
 unsubscribe from a mailing list?

 That way the people who are lazy or need hand holding are simply unable
 to subscribe, thus avoiding these people who send these unsubscribe me
 messages from ever entering the list.


 On Fri, Dec 5, 2014 at 2:49 PM, mark charts mcha...@yahoo.com wrote:

  Trained as an electrical engineer I learned again and again from my
 instructors: man is inherently lazy. Maybe that's the case. Also, some
 people have complicated lives and need hand holding. It is best to accept
 people as they are.


   On Friday, December 5, 2014 8:40 AM, Chana chana.ole...@gmail.com
 wrote:


Actually, it's capisce.

 And, obviously, we are all here because we are learning. But, now that
 instructions have been posted again - I would expect that people would not
 expect to have their
 hands held to do something so simple as unsubscribe. They are
 supposedly technologists. I would think they would be able to do whatever
 research is necessary to
 unsubscribe without flooding the list with requests. It's a fairly common
 operation for anyone who has been on the web for a while.



 On Fri, Dec 5, 2014 at 7:35 AM, mark charts mcha...@yahoo.com wrote:

  Not rude at all. We were not born with any knowledge. We learn as we
 go. We learn by asking. Capiche?


 Mark Charts


   On Friday, December 5, 2014 8:13 AM, Chana chana.ole...@gmail.com
 wrote:


   THANK YOU for posting this Very rude indeed to not learn *for
 oneself* - as opposed to expecting to be spoon fed information like a
 small child.

 On Fri, Dec 5, 2014 at 7:00 AM, Aleks Laz al-userhad...@none.at wrote:

 Dear wished unsubscriber

 Is it really that hard to start brain vX.X?

 You or anyone which have used your email account, have subscribed to this
 list, as described here.

 http://hadoop.apache.org/mailing_lists.html

 You or this Person must unsubscribe at the same way as the subscription
 was, as described here.

 http://hadoop.apache.org/mailing_lists.html

 In the case that YOU don't know how a mailinglist works, please take a
 look here.

 http://en.wikipedia.org/wiki/Electronic_mailing_list

 Even the month is young there are a lot of

 unsubscribe

 in the archive.

 http://mail-archives.apache.org/mod_mbox/hadoop-user/201412.mbox/thread

 I'm new to this list but from my point of view it is very disrespectful
 to the list members and developers that YOU don't invest a little bit of
 time by your self to search how you can unsubscribe from a list on which
 YOU have subscribed or anyone which have used your email account.

 cheers Aleks








 --
 Best regards / Met vriendelijke groeten,

 Niels Basjes







-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Are these configuration parameters deprecated?

2014-11-14 Thread Niels Basjes
A while ago I found a similar problem;
I was wondering why do a lot of tools like the hdfs, hbase shell, pig and
many other complain at startup about deprecated parameters.
It turns out that these deprecated names are still in *-default.xml files
and in various other places in the code base.

Perhaps an issue indicating that the use of the deprecated parameters
should be removed from the main code base is in order here.

Niels Basjes

On Fri, Nov 14, 2014 at 9:22 PM, Tianyin Xu t...@cs.ucsd.edu wrote:

 Hi,

 I'm very confused by some of the MapReduce configuration parameters
 which appear in the latest version of mapred-default.xml.

 http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

 Take mapreduce.task.tmp.dir as an example, I fail to find its usage
 in code but

 /* mapreduce/util/ConfigUtil.java */
  55   new DeprecationDelta(mapred.temp.dir,
  56 MRConfig.TEMP_DIR),

 My interpretation is that it's renamed into mapred.temp.dir.
 However, when I grep the new name, I still cannot find any code except
 some testing ones in

 ./hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/BenchmarkThroughput.java

 From the semantics, this should be a must-to-have parameter for MR
 jobs...

 Also, many parameters are like this. So I'm really confused.

 Am I missing something? Are these parameters deprecated?

 Thanks a lot!
 Tianyin




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Spark vs Tez

2014-10-19 Thread Niels Basjes
Very interesting!
What makes Tez more scalable than Spark?
What architectural thing makes the difference?

Niels Basjes
On Oct 19, 2014 3:07 AM, Jeff Zhang zjf...@gmail.com wrote:

 Tez has a feature called pre-warm which will launch JVM before you use it
 and you can reuse the container afterwards. So it is also suitable for
 interactive queries and is more stable and scalable than spark IMO.

 On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes ni...@basjes.nl wrote:

 It is my understanding that one of the big differences between Tez and
 Spark is is that a Tez based query still has the startup overhead of
 starting JVMs on the Yarn cluster. Spark based queries are immediately
 executed on already running JVMs.

 So for interactive dashboards Spark seems more suitable.

 Did I understand correctly?

 Niels Basjes
 On Oct 17, 2014 8:30 PM, Gavin Yue yue.yuany...@gmail.com wrote:

 Spark and tez both make MR faster, this has no doubt.

 They also provide new features like DAG, which is quite important for
 interactive query processing.  From this perspective, you could view them
 as a wrapper around MR and try to handle the intermediary buffer(files)
 more efficiently.  It is a big pain in MR.

 Also they both try to use Memory as the buffer instead of only
 filesystems.   Spark has a concept RDD, which is quite interesting and also
 limited.



 On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   It was my understanding that Spark is faster batch processing. Tez
 is the new execution engine that replaces MapReduce and is also supposed to
 speed up batch processing. Is that not correct?
 B.



  *From:* Shahab Yunus shahab.yu...@gmail.com
 *Sent:* Friday, October 17, 2014 1:12 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Spark vs Tez

  What aspects of Tez and Spark are you comparing? They have different
 purposes and thus not directly comparable, as far as I understand.

 Regards,
 Shahab

 On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? 
 Spark
 seems so popular but I’m not really seeing why.
 B.







 --
 Best Regards

 Jeff Zhang



Re: Spark vs Tez

2014-10-18 Thread Niels Basjes
It is my understanding that one of the big differences between Tez and
Spark is is that a Tez based query still has the startup overhead of
starting JVMs on the Yarn cluster. Spark based queries are immediately
executed on already running JVMs.

So for interactive dashboards Spark seems more suitable.

Did I understand correctly?

Niels Basjes
On Oct 17, 2014 8:30 PM, Gavin Yue yue.yuany...@gmail.com wrote:

 Spark and tez both make MR faster, this has no doubt.

 They also provide new features like DAG, which is quite important for
 interactive query processing.  From this perspective, you could view them
 as a wrapper around MR and try to handle the intermediary buffer(files)
 more efficiently.  It is a big pain in MR.

 Also they both try to use Memory as the buffer instead of only
 filesystems.   Spark has a concept RDD, which is quite interesting and also
 limited.



 On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   It was my understanding that Spark is faster batch processing. Tez is
 the new execution engine that replaces MapReduce and is also supposed to
 speed up batch processing. Is that not correct?
 B.



  *From:* Shahab Yunus shahab.yu...@gmail.com
 *Sent:* Friday, October 17, 2014 1:12 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Spark vs Tez

  What aspects of Tez and Spark are you comparing? They have different
 purposes and thus not directly comparable, as far as I understand.

 Regards,
 Shahab

 On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? Spark
 seems so popular but I’m not really seeing why.
 B.







Re: Bzip2 files as an input to MR job

2014-09-22 Thread Niels Basjes
Hi,

You can use the GZip inside the AVRO files and still have splittable AVRO
files.
This has the to with the fact that there is a block structure inside the
AVRO and these blocks are gzipped.

I suggest you simply try it.

Niels


On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov iva...@vesseltracker.com
wrote:

 Hi guys,
 I would like to compress the files on HDFS to save some storage.

 As far as i see bzip2 is the only format which is splitable (and slow).

 The actual files are Avro.

 So in my driver class i have :

 job.setInputFormatClass(AvroKeyInputFormat.class);

 I have number of jobs running processing Avro files so i would like to
 keep the code change to a minimum.

 Is it possible to comrpess these avro files with bzip2 and keep the code
 of MR jobs the same (or with little change)
 If it is , please give me some hints as so far i don't seem to find any
 good resources on the Internet.


 Georgi




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Generating mysql or sqlite datafiles from Hadoop (Java)?

2013-09-17 Thread Niels Basjes
Hi,

I remember hearing a while ago that (if I remember correctly) Facebook
had an outputformat that wrote the underlying MySQL database files
directly from a MapReduce job.

For my purpose an sqlite datafile would be good enough too.

I've been unable to find any of those two solutions by simply googling.
Does anyone know where I can find such a thing?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Why LineRecordWriter.write(..) is synchronized

2013-08-11 Thread Niels Basjes
I expect the impact on the IO speed to be almost 0 because waiting for a
single disk seek is longer than many thousands of calls to a synchronized
method.

Niels
On Aug 11, 2013 3:00 PM, Harsh J ha...@cloudera.com wrote:

 Yes, I feel we could discuss this over a JIRA to remove it if it hurts
 perf. too much, but it would have to be a marked incompatible change,
 and we have to add a note about the lack of thread safety in the
 javadoc of base Mapper/Reducer classes.

 On Sun, Aug 11, 2013 at 1:26 PM, Sathwik B P sathwik...@gmail.com wrote:
  Hi Harsh,
 
  Does it make any sense to keep the method in LRW still synchronized.
 Isn't
  it creating unnecessary overhead for non multi threaded implementations.
 
  regards,
  sathwik
 
 
  On Fri, Aug 9, 2013 at 7:16 AM, Harsh J ha...@cloudera.com wrote:
 
  I suppose I should have been clearer. There's no problem out of box if
  people stick to the libraries we offer :)
 
  Yes the LRW was marked synchronized at some point over 8 years ago [1]
  in support for multi-threaded maps, but the framework has changed much
  since then. The MultithreadedMapper/etc. API we offer now
  automatically shields the devs away from having to think of output
  thread safety [2].
 
  I can imagine there can only be a problem if a user writes their own
  unsafe multi threaded task. I suppose we could document that in the
  Mapper/MapRunner and Reducer APIs.
 
  [1] - http://svn.apache.org/viewvc?view=revisionrevision=171186 -
  Commit added a synchronized to the write call.
  [2] - MultiThreadedMapper/etc. synchronize over the collector -
 
 
 http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.java?view=markup
 
  On Thu, Aug 8, 2013 at 7:52 PM, Azuryy Yu azury...@gmail.com wrote:
   sequence writer is also synchronized, I dont think this is bad.
  
   if you call HDFS api to write concurrently, then its necessary.
  
   On Aug 8, 2013 7:53 PM, Jay Vyas jayunit...@gmail.com wrote:
  
   Then is this a bug?  Synchronization in absence of any race condition
   is
   normally considered bad.
  
   In any case id like to know why this writer is synchronized whereas
 the
   other one are not.. That is, I think, then point at issue: either
 other
   writers should be synchronized or else this one shouldn't be -
   consistency
   across the write implementations is probably desirable so that
 changes
   to
   output formats or record writers don't lead to bugs in multithreaded
   environments .
  
   On Aug 8, 2013, at 6:50 AM, Harsh J ha...@cloudera.com wrote:
  
   While we don't fork by default, we do provide a MultithreadedMapper
   implementation that would require such synchronization. But if you
 are
   asking is it necessary, then perhaps the answer is no.
  
   On Aug 8, 2013 3:43 PM, Azuryy Yu azury...@gmail.com wrote:
  
   its not hadoop forked threads, we may create a line record writer,
   then
   call this writer concurrently.
  
   On Aug 8, 2013 4:00 PM, Sathwik B P sathwik...@gmail.com wrote:
  
   Hi,
   Thanks for your reply.
   May I know where does hadoop fork multiple threads to use a single
   RecordWriter.
  
   regards,
   sathwik
  
   On Thu, Aug 8, 2013 at 7:06 AM, Azuryy Yu azury...@gmail.com
 wrote:
  
   because we may use multi-threads to write a single file.
  
   On Aug 8, 2013 2:54 PM, Sathwik B P sath...@apache.org wrote:
  
   Hi,
  
   LineRecordWriter.write(..) is synchronized. I did not find any
   other
   RecordWriter implementations define the write as synchronized.
   Any specific reason for this.
  
   regards,
   sathwik
  
  
  
 
 
 
  --
  Harsh J
 
 



 --
 Harsh J



Re: Why LineRecordWriter.write(..) is synchronized

2013-08-08 Thread Niels Basjes
I may be nitpicking here but if perhaps the answer is no then I conclude:
Perhaps the other implementations of RecordWriter are a race condition/file
corruption ready to occur.


On Thu, Aug 8, 2013 at 12:50 PM, Harsh J ha...@cloudera.com wrote:

 While we don't fork by default, we do provide a MultithreadedMapper
 implementation that would require such synchronization. But if you are
 asking is it necessary, then perhaps the answer is no.
 On Aug 8, 2013 3:43 PM, Azuryy Yu azury...@gmail.com wrote:

 its not hadoop forked threads, we may create a line record writer, then
 call this writer concurrently.
 On Aug 8, 2013 4:00 PM, Sathwik B P sathwik...@gmail.com wrote:

 Hi,
 Thanks for your reply.
 May I know where does hadoop fork multiple threads to use a single
 RecordWriter.

 regards,
 sathwik

 On Thu, Aug 8, 2013 at 7:06 AM, Azuryy Yu azury...@gmail.com wrote:

 because we may use multi-threads to write a single file.
 On Aug 8, 2013 2:54 PM, Sathwik B P sath...@apache.org wrote:

 Hi,

 LineRecordWriter.write(..) is synchronized. I did not find any other
 RecordWriter implementations define the write as synchronized.
 Any specific reason for this.

 regards,
 sathwik





-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Why LineRecordWriter.write(..) is synchronized

2013-08-08 Thread Niels Basjes
I would say yes make this a Jira.
The actual change can fall (as proposed by Jay) in two directions: Put
in synchronization
in all implementations OR take it out of all implementations.

I think the first thing to determine is why the synchronization was put
into the  LineRecordWriter in the first place.

https://github.com/apache/hadoop-common/blame/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.java

The oldest I have been able to find is a commit on 2009-05-18 for
HADOOP-4687 that is about moving stuff around (i.e. this code is even older
than that).

Niels



On Thu, Aug 8, 2013 at 2:21 PM, Sathwik B P sath...@apache.org wrote:

 Hi Harsh,

 Do you want me to raise a Jira on this.

 regards,
 sathwik


 On Thu, Aug 8, 2013 at 5:23 PM, Jay Vyas jayunit...@gmail.com wrote:

 Then is this a bug?  Synchronization in absence of any race condition is
 normally considered bad.

 In any case id like to know why this writer is synchronized whereas the
 other one are not.. That is, I think, then point at issue: either other
 writers should be synchronized or else this one shouldn't be - consistency
 across the write implementations is probably desirable so that changes to
 output formats or record writers don't lead to bugs in multithreaded
 environments .

 On Aug 8, 2013, at 6:50 AM, Harsh J ha...@cloudera.com wrote:

 While we don't fork by default, we do provide a MultithreadedMapper
 implementation that would require such synchronization. But if you are
 asking is it necessary, then perhaps the answer is no.
 On Aug 8, 2013 3:43 PM, Azuryy Yu azury...@gmail.com wrote:

 its not hadoop forked threads, we may create a line record writer, then
 call this writer concurrently.
 On Aug 8, 2013 4:00 PM, Sathwik B P sathwik...@gmail.com wrote:

 Hi,
 Thanks for your reply.
 May I know where does hadoop fork multiple threads to use a single
 RecordWriter.

 regards,
 sathwik

 On Thu, Aug 8, 2013 at 7:06 AM, Azuryy Yu azury...@gmail.com wrote:

 because we may use multi-threads to write a single file.
 On Aug 8, 2013 2:54 PM, Sathwik B P sath...@apache.org wrote:

 Hi,

 LineRecordWriter.write(..) is synchronized. I did not find any other
 RecordWriter implementations define the write as synchronized.
 Any specific reason for this.

 regards,
 sathwik






-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Is there any way to use a hdfs file as a Circular buffer?

2013-07-24 Thread Niels Basjes
A circular file on hdfs is not possible.

Some of the ways around this limitation:
- Create a series of files and delete the oldest file when you have too
much.
- Put the data into an hbase table and do something similar.
- Use completely different technology like mongodb which has built in
support for a circular buffer (capped collection).

Niels

Hi all,
   Is there any way to use a hdfs file as a Circular buffer? I mean,
if I set a quotas to a directory on hdfs, and writting data to a file
in that directory continuously. Once the quotas exceeded, I can
redirect the writter and write the data from the beginning of the file
automatically .


Use a URL for the HADOOP_CONF_DIR?

2013-07-15 Thread Niels Basjes
Hi,

When giving users access to an Hadoop cluster they need a few XML config
files (like the hadoop-site.xml ).
They put these somewhere on they PC and start running their jobs on the
cluster.

Now when you're changing the settings you want those users to use (for
example you changed some tcpport) you need them all to update their config
files.

My question is: Can you set the HADOOP_CONF_DIR to be a URL on a webserver?
A while ago I tried this and (back then) it didn't work.

Would this be a useful enhancement?

-- 
Best regards,

Niels Basjes


Running a single cluster in multiple datacenters

2013-07-15 Thread Niels Basjes
Hi,

Last week we had a discussion at work regarding setting up our new Hadoop
cluster(s).
One of the things that has changed is that the importance of the Hadoop
stack is growing so we want to be more available.

One of the points we talked about was setting up the cluster in such a way
that the nodes are physically located in two separate datacenters (on
opposite sides of the same city) with a big network connection in between.
We're currently talking about a cluster in the 50 nodes range, but that
will grow over time.

The advantages I see:
- More CPU power available for jobs.
- The data is automatically copied between the datacenters as long as we
configure them to be different 'racks'.

The disadvantages I see:
- If the network goes out then one half is dead and the other half will
most likely go to safemode because the recovering of the missing replicas
will fill up the disks fast.

What things should we consider also?
Has anyone any experience with such a setup?
Is it a good idea to do this?
What are better options for us to consider?

Thanks for any input.
-- 
Best regards,

Niels Basjes


Re: Inputformat

2013-06-21 Thread Niels Basjes
If you try to hammer in a nail (json file) with a screwdriver (
XMLInputReader) then perhaps the reason it won't work may be that you are
using the wrong tool?
 On Jun 21, 2013 11:38 PM, jamal sasha jamalsha...@gmail.com wrote:

 Hi,

   I am using one of the libraries which rely on InputFormat.
 Right now, it is reading xml files spanning across mutiple lines.
 So currently the input format is like:

 public class XMLInputReader extends FileInputFormatLongWritable, Text {

   public static final String START_TAG = page;
   public static final String END_TAG = /page;

   @Override
   public RecordReaderLongWritable, Text getRecordReader(InputSplit split,
   JobConf conf, Reporter reporter) throws IOException {
 conf.set(XMLInputFormat.START_TAG_KEY, START_TAG);
 conf.set(XMLInputFormat.END_TAG_KEY, END_TAG);
 return new XMLRecordReader((FileSplit) split, conf);
   }
 }
 So, in above if the data is like:

 page

  soemthing \n
 somthing \n

 /page

 It process this sort of data..


 Now, i want to use the same framework but for json files but lasting just
 single line..

 So I guess my
 my START_TAG can be {

 Will my END_TAG be }\n

 it can't be } as there can be nested json in this data?

 Any clues
 Thanks



Re: gz containing null chars?

2013-06-10 Thread Niels Basjes
My best guess is that at a low level a string is often terminated by having
a null byte at the end.
Perhaps that's where the difference lies.
Perhaps the gz decompressor simply stops at the null byte and the basic
record reader that follows simply continues.
In this situation your input file contains bytes that should not occur in
an ASCII file (like the json file you have) and as such you can expect the
unexpected ;)

Niels
On Jun 10, 2013 7:24 PM, William Oberman ober...@civicscience.com wrote:

 I posted this to the pig mailing list, but it might be more related to
 hadoop itself, I'm not sure.

 Quick recap: I had a file of \n separated lines of JSON.  I decided to
 compress it to save on storage costs.  After compression I got a different
 answer for a pig query that basically == count lines.

 After a lot of digging, I found an input file that had a line that is a
 huge block of null characters followed by a \n.  I wrote scripts to
 examine the file directly, and if I stop counting at the weird line, I get
 the same count as what pig claims for that file.   If I count all lines
 (e.g. don't fail at the corrupt line) I get the uncompressed count pig
 claims.

 I don't know how to debug hadoop/pig quite as well, though I'm trying now.
  But, my working theory is that some combination of pig/hadoop aborts
 processing the gz stream on a null character (or something like that), but
 keeps chugging on a non-gz stream.  Does that sound familiar or make sense
 to anyone?

 will



Re: Reducer to output only json

2013-06-04 Thread Niels Basjes
Have you tried something like this (i do not have a pc here to check this
code)

context.write(NullWritable, new Text(jsn.toString()));
On Jun 4, 2013 8:10 PM, Chengi Liu chengi.liu...@gmail.com wrote:

 Hi,

  I have the following redcuer class

 public static class TokenCounterReducer
 extends ReducerText, Text, Text, Text {
 public void reduce(Text key, IterableText values, Context context)
 throws IOException, InterruptedException {

 //String[] fields = s.split(\t, -1)
 JSONObject jsn = new JSONObject();
 int sum = 0;
 for (Text value : values) {
 String[] vals = value.toString().split(\t);
 String[] targetNodes = vals[0].toString().split(,,-1);
 jsn.put(source,vals[1] );
 jsn.put(target,targetNodes);
 //sum += value.get();
 }
// context.write(key, new Text(sum));
 }
 }

 I want to save that json to hdfs?

 It was very trivial in hadoop streaming.. but how do i do it in hadoop
 java?
 Thanks



Re: Experimental Hadoop Cluster - Linux Windows machines

2013-06-01 Thread Niels Basjes
My first suggestion is to go for CentOS as it is free and almost the same
as RHEL.
Also having a 64 bit OS lets you use a bit more of the installed memory

Then if you can simply install CentOS on these machines and have the
network running you should be fine.

I have been running experiments on something identical to what you are
describing here.

Niels Basjes


On Sat, Jun 1, 2013 at 9:47 PM, Rody BigData rodybigd...@gmail.com wrote:



 I have some old ( not very old - each of 4GB RAM with a decent processor
 etc., and working fine till now ) Dell Windows XP machines and want to
 convert them to a Red Hat Enterprise Linux (RHEL) for a Hadoop cluster for
 my experimental purposes.

 Can you give me some suggestions on how to proceed on this plan? My
 initial plan is to target 4 to 5 machines.

 Does the hardware on which Windows XP machine is based supports RHEL?

 Are there any other drivers or software should we install before
 installing RHEL on this machines?

 How easy/difficult is this, if some one like a Linux Systems Administrator
 is involved in this?





-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Experimental Hadoop Cluster - Linux Windows machines

2013-06-01 Thread Niels Basjes
I've installed CentOS on several different types of old (originally Windows
XP)  Dell desktops for the last 4 years (i.e. desktops as old as 7 years
ago) and so far installing CentOS was as easy as booting from the
installation CD/DVD and doing next, next, finish.

The only thing that you may run into is that the specific cpu really
doesn't support 64bit.
You'll get an error very early in the installation process.

Niels
On Jun 1, 2013 10:10 PM, Rody BigData rodybigd...@gmail.com wrote:

 So what is the procedure for installing CentOS on a Windows XP machines (
 want to format XP ) . Is it a complicated procedure? Some kind of drivers
 and issues, is it common to expect ?


 On Sat, Jun 1, 2013 at 4:05 PM, Marco Shaw marco.s...@gmail.com wrote:

 If you're running XP, be aware of the others suggesting 64-bit.  That
 depends on the exact proc you're running.

 You sort of need to break this down into first determining how to get
 Linux on your systems.

 RHEL is pretty costly for a test, unless you've got that covered.  Go
 with CentOS for a proof of concept.

 Get that covered first, then move on to determining how to get Hadoop on
 them.

 Marco



 On Sat, Jun 1, 2013 at 4:47 PM, Rody BigData rodybigd...@gmail.comwrote:



 I have some old ( not very old - each of 4GB RAM with a decent processor
 etc., and working fine till now ) Dell Windows XP machines and want to
 convert them to a Red Hat Enterprise Linux (RHEL) for a Hadoop cluster for
 my experimental purposes.

 Can you give me some suggestions on how to proceed on this plan? My
 initial plan is to target 4 to 5 machines.

 Does the hardware on which Windows XP machine is based supports RHEL?

 Are there any other drivers or software should we install before
 installing RHEL on this machines?

 How easy/difficult is this, if some one like a Linux Systems
 Administrator is involved in this?







Re: Configuring SSH - is it required? for a psedo distriburted mode?

2013-05-19 Thread Niels Basjes
I never configure the ssh feature.
Not for running on a single node and not for a full size cluster.
I simply start all the required deamons (name/data/job/task) and configure
them on which ports each can be reached.

Niels Basjes
 On May 16, 2013 4:55 PM, Raj Hadoop hadoop...@yahoo.com wrote:

  Hi,

 I have a dedicated user on Linux server for hadoop. I am installing it in
 psedo distributed mode on this box. I want to test my programs on this
 machine. But i see that in installation steps - they were mentioned that
 SSH needs to be configured. If it is single node, I dont require it
 ...right? Please advise.

 I was looking at this site

 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

 It menionted like this -
 
 Hadoop requires SSH access to manage its nodes, i.e. remote machines plus
 your local machine if you want to use Hadoop on it (which is what we want
 to do in this short tutorial). For our single-node setup of Hadoop, we
 therefore need to configure SSH access to localhost for the hduser user
 we created in the previous section.
 

 Thanks,
 Raj




Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-16 Thread Niels Basjes
If you make sure that everything uses NTP then this becomes an irrelevant
distinction.


On Thu, May 16, 2013 at 4:01 PM, Jane Wayne jane.wayne2...@gmail.comwrote:

 yes, but that gets the current time on the server, not the hadoop cluster.
 i need to be able to probe the date/time of the hadoop cluster.


 On Tue, May 14, 2013 at 5:09 PM, Niels Basjes ni...@basjes.nl wrote:

  I made a typo. I meant API (instead of SPI).
 
  Have a look at this for more information:
 
 
 http://stackoverflow.com/questions/833768/java-code-for-getting-current-time
 
 
  If you have a client that is not under NTP then that should be the way to
  fix your issue.
  Once you  have that getting the current time is easy.
 
  Niels Basjes
 
 
 
  On Tue, May 14, 2013 at 5:46 PM, Jane Wayne jane.wayne2...@gmail.com
  wrote:
 
   niels,
  
   i'm not familiar with the native java spi. spi = service provider
   interface? could you let me know if this spi is part of the hadoop
   api? if so, which package/class?
  
   but yes, all nodes on the cluster are using NTP to synchronize time.
   however, the server (which is not a part of the hadoop cluster)
   accessing/interfacing with the hadoop cluster cannot be assumed to be
   using NTP. will this still make a difference? and actually, this is
   the primary reason why i need to get the date/time of the hadoop
   cluster (need to check if the date/time of the hadooop cluster is in
   sync with the server).
  
  
  
   On Tue, May 14, 2013 at 11:38 AM, Niels Basjes ni...@basjes.nl
 wrote:
If you have all nodes using NTP then you can simply use the native
 Java
   SPI
to get the current system time.
   
   
On Tue, May 14, 2013 at 4:41 PM, Jane Wayne 
 jane.wayne2...@gmail.com
   wrote:
   
hi all,
   
is there a way to get the current time of a hadoop cluster via the
api? in particular, getting the time from the namenode or jobtracker
would suffice.
   
i looked at JobClient but didn't see anything helpful.
   
   
   
   
--
Best regards / Met vriendelijke groeten,
   
Niels Basjes
  
 
 
 
  --
  Best regards / Met vriendelijke groeten,
 
  Niels Basjes
 




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-14 Thread Niels Basjes
If you have all nodes using NTP then you can simply use the native Java SPI
to get the current system time.


On Tue, May 14, 2013 at 4:41 PM, Jane Wayne jane.wayne2...@gmail.comwrote:

 hi all,

 is there a way to get the current time of a hadoop cluster via the
 api? in particular, getting the time from the namenode or jobtracker
 would suffice.

 i looked at JobClient but didn't see anything helpful.




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-14 Thread Niels Basjes
I made a typo. I meant API (instead of SPI).

Have a look at this for more information:
http://stackoverflow.com/questions/833768/java-code-for-getting-current-time


If you have a client that is not under NTP then that should be the way to
fix your issue.
Once you  have that getting the current time is easy.

Niels Basjes



On Tue, May 14, 2013 at 5:46 PM, Jane Wayne jane.wayne2...@gmail.comwrote:

 niels,

 i'm not familiar with the native java spi. spi = service provider
 interface? could you let me know if this spi is part of the hadoop
 api? if so, which package/class?

 but yes, all nodes on the cluster are using NTP to synchronize time.
 however, the server (which is not a part of the hadoop cluster)
 accessing/interfacing with the hadoop cluster cannot be assumed to be
 using NTP. will this still make a difference? and actually, this is
 the primary reason why i need to get the date/time of the hadoop
 cluster (need to check if the date/time of the hadooop cluster is in
 sync with the server).



 On Tue, May 14, 2013 at 11:38 AM, Niels Basjes ni...@basjes.nl wrote:
  If you have all nodes using NTP then you can simply use the native Java
 SPI
  to get the current system time.
 
 
  On Tue, May 14, 2013 at 4:41 PM, Jane Wayne jane.wayne2...@gmail.com
 wrote:
 
  hi all,
 
  is there a way to get the current time of a hadoop cluster via the
  api? in particular, getting the time from the namenode or jobtracker
  would suffice.
 
  i looked at JobClient but didn't see anything helpful.
 
 
 
 
  --
  Best regards / Met vriendelijke groeten,
 
  Niels Basjes




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: How to process only input files containing 100% valid rows

2013-04-19 Thread Niels Basjes
How about a different approach:
If you use the multiple output option you can process the valid lines in a
normal way and put the invalid lines in a special separate output file.
On Apr 18, 2013 9:36 PM, Matthias Scherer matthias.sche...@1und1.de
wrote:

 Hi all,

 ** **

 In my mapreduce job, I would like to process only whole input files
 containing only valid rows. If one map task processing an input split of a
 file detects an invalid row, the whole file should be “marked” as invalid
 and not processed at all. This input file will then be cleansed by another
 process, and taken again as input to the next run of my mapreduce job.

 ** **

 My first idea was to set a counter in the mapper after detecting an
 invalid line with the name of the file as the counter name (derived from
 input split). Then additionally put the input filename to the map output
 value (which is already a MapWritable, so adding the filename is no
 problem). And in the reducer I could filter out any rows belonging to the
 counters written in the mapper.

 ** **

 Each job has some thousand input files. So in the worst case there could
 be as many counters written to mark invalid input files. Is this a feasible
 approach? Does the framework guarantee that all counters written in the
 mappers are synchronized (visible) in the reducers? And could this number
 of counters lead to OOME in the jobtracker?

 ** **

 Are there better approaches? I could also process the files using a non
 splitable input format. Is there a way to reject the already outputted rows
 of a the map task processing an input split?

 ** **

 Thanks,

 Matthias

 ** **



Re: Can I perfrom a MR on my local filesystem

2013-02-16 Thread Niels Basjes
Have a look at this
http://stackoverflow.com/questions/3546025/is-it-possible-to-run-hadoop-in-pseudo-distributed-operation-without-hdfs

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 17 feb. 2013 07:51 schreef Agarwal, Nikhil nikhil.agar...@netapp.com
het volgende:

  Hi,

 

 Recently I followed a blog to run Hadoop on a single node cluster.

 I wanted to ask that in a single node set-up of Hadoop is it necessary to
 have the data copied into Hadoop’s HDFS before running a MR on it. Can I
 run MR on my local file system too without copying the data to HDFS? 

 In the Hadoop source code I saw there are implementations of other file
 systems too like S3, KFS, FTP, etc. so how does exactly a MR happen on S3
 data store ? How does JobTracker or Tasktracker run in S3 ? 

 ** **

 I would be very thankful to get a reply to this.

 ** **

 Thanks  Regards,

 Nikhil

 ** **



Re: how to find top N values using map-reduce ?

2013-02-02 Thread Niels Basjes
My suggestion is to use secondary sort with a single reducer. That easy you
can easily extract the top N. If you want to get the top N% you'll need an
additional phase to determine how many records this N% really is.

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 2 feb. 2013 12:08 schreef praveenesh kumar praveen...@gmail.com het
volgende:

 My actual problem is to rank all values and then run logic 1 to top n%
 values and logic 2 to rest values.
 1st  - Ranking ? (need major suggestions here)
 2nd - Find top n% out of them.
 Then rest is  covered.

 Regards
 Praveenesh

 On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang lakech...@gmail.com wrote:
  there's one thing i want to clarify that you can use multi-reducers to
 sort
  the data globally and then cat all the parts to get the top n records.
 The
  data in all parts are globally in order.
  Then you may find the problem is much easier.
 
  在 2013-2-2 下午3:18,praveenesh kumar praveen...@gmail.com写道:
 
  Actually what I am trying to find to top n% of the whole data.
  This n could be very large if my data is large.
 
  Assuming I have uniform rows of equal size and if the total data size
  is 10 GB, using the above mentioned approach, if I have to take top
  10% of the whole data set, I need 10% of 10GB which could be rows
  worth of 1 GB (roughly) in my mappers.
  I think that would not be possible given my input splits are of
  64/128/512 MB (based on my block size) or am I making wrong
  assumptions. I can increase the inputsplit size, but is there a better
  way to find top n%.
 
 
  My whole actual problem is to give ranks to some values and then find
  out the top 10 ranks.
 
  I think this context can give more idea about the problem ?
 
  Regards
  Praveenesh
 
  On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov ekirpic...@gmail.com
 
  wrote:
   Hi,
  
   Can you tell more about:
* How big is N
* How big is the input dataset
* How many mappers you have
* Do input splits correlate with the sorting criterion for top N?
  
   Depending on the answers, very different strategies will be optimal.
  
  
  
   On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar
   praveen...@gmail.comwrote:
  
   I am looking for a better solution for this.
  
   1 way to do this would be to find top N values from each mappers and
   then find out the top N out of them in 1 reducer.  I am afraid that
   this won't work effectively if my N is larger than number of values
 in
   my inputsplit (or mapper input).
  
   Otherway is to just sort all of them in 1 reducer and then do the cat
   of
   top-N.
  
   Wondering if there is any better approach to do this ?
  
   Regards
   Praveenesh
  
  
  
  
   --
   Eugene Kirpichov
   http://www.linkedin.com/in/eugenekirpichov
   http://jkff.info/software/timeplotters - my performance visualization
   tools



Re: how to find top N values using map-reduce ?

2013-02-02 Thread Niels Basjes
My suggestion is to use secondary sort with a single reducer. That easy you
can easily extract the top N. If you want to get the top N% you'll need an
additional phase to determine how many records this N% really is.

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 2 feb. 2013 12:08 schreef praveenesh kumar praveen...@gmail.com het
volgende:

 My actual problem is to rank all values and then run logic 1 to top n%
 values and logic 2 to rest values.
 1st  - Ranking ? (need major suggestions here)
 2nd - Find top n% out of them.
 Then rest is  covered.

 Regards
 Praveenesh

 On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang lakech...@gmail.com wrote:
  there's one thing i want to clarify that you can use multi-reducers to
 sort
  the data globally and then cat all the parts to get the top n records.
 The
  data in all parts are globally in order.
  Then you may find the problem is much easier.
 
  在 2013-2-2 下午3:18,praveenesh kumar praveen...@gmail.com写道:
 
  Actually what I am trying to find to top n% of the whole data.
  This n could be very large if my data is large.
 
  Assuming I have uniform rows of equal size and if the total data size
  is 10 GB, using the above mentioned approach, if I have to take top
  10% of the whole data set, I need 10% of 10GB which could be rows
  worth of 1 GB (roughly) in my mappers.
  I think that would not be possible given my input splits are of
  64/128/512 MB (based on my block size) or am I making wrong
  assumptions. I can increase the inputsplit size, but is there a better
  way to find top n%.
 
 
  My whole actual problem is to give ranks to some values and then find
  out the top 10 ranks.
 
  I think this context can give more idea about the problem ?
 
  Regards
  Praveenesh
 
  On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov ekirpic...@gmail.com
 
  wrote:
   Hi,
  
   Can you tell more about:
* How big is N
* How big is the input dataset
* How many mappers you have
* Do input splits correlate with the sorting criterion for top N?
  
   Depending on the answers, very different strategies will be optimal.
  
  
  
   On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar
   praveen...@gmail.comwrote:
  
   I am looking for a better solution for this.
  
   1 way to do this would be to find top N values from each mappers and
   then find out the top N out of them in 1 reducer.  I am afraid that
   this won't work effectively if my N is larger than number of values
 in
   my inputsplit (or mapper input).
  
   Otherway is to just sort all of them in 1 reducer and then do the cat
   of
   top-N.
  
   Wondering if there is any better approach to do this ?
  
   Regards
   Praveenesh
  
  
  
  
   --
   Eugene Kirpichov
   http://www.linkedin.com/in/eugenekirpichov
   http://jkff.info/software/timeplotters - my performance visualization
   tools



Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-30 Thread Niels Basjes
F. put a mongodb replica set on all hadoop workernodes and let the tasks
query the mongodb at localhost.

(this is what I did recently with a multi GiB dataset)

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 30 dec. 2012 20:01 schreef Jonathan Bishop jbishop@gmail.com het
volgende:

 E. Store them in hbase...


 On Sun, Dec 30, 2012 at 12:24 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 If it is a small number, A seems the best way to me.

 On Friday, December 28, 2012, Kshiva Kps wrote:


 Which one is current ..


 What is the preferred way to pass a small number of configuration
 parameters to a mapper or reducer?





 *A.  *As key-value pairs in the jobconf object.

 * *

 *B.  *As a custom input key-value pair passed to each mapper or
 reducer.

 * *

 *C.  *Using a plain text file via the Distributedcache, which each
 mapper or reducer reads.

 * *

 *D.  *Through a static variable in the MapReduce driver class (i.e.,
 the class that submits the MapReduce job).



 *Answer: B*







Re: Doubts on compressed file

2012-11-07 Thread Niels Basjes
Hi,

 If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and
 store in HDFS?

Yes.

 I understand that a single mapper can work with GZip as it reads the entire
 file from beginning to end... In that case if the GZip file size is larget
 than 128 MB will it get splitted into blocks and stored in HDFS?

Yes, and then the mapper will read the other parts of the file over the network.
So what I do is I upload such files with a bigger HDFS blocksize so
the mapper has the entire file locally.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Hadoop Real time help

2012-08-22 Thread Niels Basjes
Thanks for the pointers, I have stuff to read now :)

On Mon, Aug 20, 2012 at 9:37 AM, Bertrand Dechoux decho...@gmail.com wrote:
 The terms are
 * ESP : http://en.wikipedia.org/wiki/Event_stream_processing
 * CEP : http://en.wikipedia.org/wiki/Complex_event_processing

 By the way, processing streams in real time tends toward being a pleonasm.

 MapReduce follows a batch architecture. You keep data until a given time.
 You then process everything. And at the end you provide all the results.
 Stream processing has by definition a more 'smooth' throughput. Each event
 is processed at a time and potentially each processing could lead to a
 result.

 I don't know any complete overview of such tools.
 Esper is well known in that space.
 FlumeBase was an attempt to do something similar (as far as I can tell).
 It shows how an ESP engine fits with log collection using a tool such as
 Flume.

 Then you also have other solutions which will allow you to scale such as
 Storm.
 A few people have already considered using Storm for scalability and Esper
 to do the real computation.

 Regards

 Bertrand


 On Sun, Aug 19, 2012 at 9:44 PM, Niels Basjes ni...@basj.es wrote:

 Is there a complete overview of the tools that allow processing streams
 of data in realtime?

 Or even better; what are the terms to google for?

 --
 Met vriendelijke groet,
 Niels Basjes
 (Verstuurd vanaf mobiel )

 Op 19 aug. 2012 18:22 schreef Bertrand Dechoux decho...@gmail.com het
 volgende:

 That's a good question. More and more people are talking about Hadoop
 Real Time.
 One key aspect of this question is whether we are talking about MapReduce
 or not.

 MapReduce greatly improves the response time of any data intensive jobs
 but it is still a batch framework with a noticeable latency.

 There is multiple ways to improve the latency :
 * ESP/CEP solutions (like Esper, FlumeBase, ...)
 * Big Table clones (like HBase ...)
 * YARN with a non MapReduce application
 * ...

 But it will really depend on the context and the definition of 'real
 time'.

 Regards

 Bertrand



 On Sun, Aug 19, 2012 at 5:44 PM, mahout user mahoutu...@gmail.com
 wrote:

 Hello folks,


I am new to hadoop, I just want to get information that how hadoop
 framework is usefull for real time service.?can any one explain me..?

 Thanks.




 --
 Bertrand Dechoux




 --
 Bertrand Dechoux



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: output/input ratio 1 for map tasks?

2012-07-30 Thread Niels Basjes
Hi,

On Mon, Jul 30, 2012 at 8:47 PM, brisk mylinq...@gmail.com wrote:
 Does anybody know if there are some cases where the output/input ratio for
 map tasks is larger than 1? I can just think of for the sort, it's 1 and for
 the search job it's usually smaller than 1...

For a simple example: Have a look at the WordCount example.

Input of a single map call is 1 record: This is a line
Output are 4 records:
This1
is   1
a1
line 1

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Making gzip splittable for Hadoop

2012-03-30 Thread Niels Basjes
Hi,

In many Hadoop production environments you get gzipped files as the raw
input. Usually these are Apache HTTPD logfiles. When putting these gzipped
files into Hadoop you are stuck with exactly 1 map task per input file. In
many scenarios this is fine. However when doing a lot of work in this very
first map task it may very well be advantageous to dividing the work over
multiple tasks, even if there is a penalty for this scaling out.

I've created an addon for Hadoop that makes this possible.

I've reworked the patch I initially created to be included in hadoop (see
HADOOP-7076).
It can now be used by simply adding a jar file to the classpath of an
existing Hadoop installation.

I put the code on github ( https://github.com/nielsbasjes/splittablegzip )
and (for now) the description on my homepage:
http://niels.basjes.nl/splittable-gzip

This feature only works with Hadoop 0.21 and up (I tested it with Cloudera
CDH4b1).
So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823).

Running mvn package automatically generates an RPM on my CentOS system.

Have fun with it an let me know what you think.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Making gzip splittable for Hadoop

2012-03-30 Thread Niels Basjes
Hi,

In many Hadoop production environments you get gzipped files as the raw
input. Usually these are Apache HTTPD logfiles. When putting these gzipped
files into Hadoop you are stuck with exactly 1 map task per input file. In
many scenarios this is fine. However when doing a lot of work in this very
first map task it may very well be advantageous to dividing the work over
multiple tasks, even if there is a penalty for this scaling out.

I've created an addon for Hadoop that makes this possible.

I've reworked the patch I initially created to be included in hadoop (see
HADOOP-7076).
It can now be used by simply adding a jar file to the classpath of an
existing Hadoop installation.

I put the code on github ( https://github.com/nielsbasjes/splittablegzip )
and (for now) the description on my homepage:
http://niels.basjes.nl/splittable-gzip

This feature only works with Hadoop 0.21 and up (I tested it with Cloudera
CDH4b1).
So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823).

Running mvn package automatically generates an RPM on my CentOS system.

Have fun with it an let me know what you think.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Merge sorting reduce output files

2012-03-01 Thread Niels Basjes
Hi,

On Thu, Mar 1, 2012 at 00:07, Robert Evans ev...@yahoo-inc.com wrote:

  Sorry it has taken me so long to respond.  Today has been a very crazy
 day.


No worries.


 I am just guessing what your algorithm is for auto-complete.


What we have has a lot more features. Yet the basic idea of what we have is
similar enough to what you describe for this discussion.


  If we want the keys to come out in sorted order, we need to have a
 sequence file with the partition keys for the total order partitioner.
 TeraSort generates a partition file by getting 
 This only really works for Terasort because it assumes that all of the
 partitions are more or less random already.


And that is something I don't have.


 This is the case for the output of a typical map/reduce job where the
 reduce does not change the keys passed in and the output of the reducer is
 less then a block in size.  That sure sounds like what wordcount does to
 me.  The only real way to get around that is to do it as part of a
 map/reduce job, and do some random sampling instead of reading the first N.
  It should be a map/reduce job because it is going to be reading a lot more
 data then TeraSort’s partition generation code.  In this case you would
 have a second M/R job that runs after the first and randomly samples
 words/phrases to work on.  It would then generate the increasing long
 phrases and send them all to a single reducer that would buffer them up,
 and when the Reducer has no more input it would output every Nth key so
 that you get the proper number of partitions for the Reducers.  You could
 sort these keys yourself to be sure, but they should come in in sorted
 order so why bother resorting.

 If my assumptions are totally wrong here please let me know.


I've had a discussion with some coworkers and we came to a possible
solution that is very closely related to your idea.
Because this is a job that runs periodically we think we can assume the
distribution of the dataset will have a similar shape from one run to the
next.
If this assumption holds we can:
1) Create a job that takes the output of run 1 and create a aggregate that
can be used to partition the dataset
2) Use the partitioning dataset from '1)' to distribute the processing for
the next run.

Thanks for your suggestions.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Merge sorting reduce output files

2012-02-29 Thread Niels Basjes
Robert,

On Tue, Feb 28, 2012 at 23:28, Robert Evans ev...@yahoo-inc.com wrote:

  I am not sure I can help with that unless I know better what “a special
 distribution” means.


The thing is that this application is a Auto Complete feature that has a
key that is the letters that have been typed so far.
Now for several reasons we need this to be sorted by length of the input.
So the '1 letter suggestions' first, then the '2 letter suggestions' etc.
I've been trying to come up with an automatic partitioning that would split
the dataset into something like 30 parts that when concatenated do what you
suggest.

Unless you are doing a massive amount of processing in your reducer having
 a partition that is only close to balancing the distribution is a big win
 over all of the other options that put the data on a single machine and
 sort it there.  Even if you are doing a lot of processing in the reducer,
 or you need a special grouping to make the reduce work properly having a
 second map/reduce job to sort the data that is just close to balancing I
 would suspect would beat out all of the other options.


Thanks, this is a useful suggestion. I'll see if there is a pattern in the
data and from there simply manual define the partitions based on the
pattern we find.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Niels Basjes
Hi,

On Wed, Feb 29, 2012 at 13:10, Michel Segel michael_se...@hotmail.comwrote:

 Let's play devil's advocate for a second?


I always like that :)


 Why?


Because then datafiles from other systems (like the Apache HTTP webserver)
can be processed without preprocessing more efficiently.

Snappy exists.


Compared to gzip: Snappy is faster, compresses a bit less and is
unfortunately not splittable.

The only advantage is that you don't have to convert from gzip to snappy
 and can process gzip files natively.


Yes, that and the fact that the files are smaller.
Note that I've described some of these considerations in the javadoc.

Next question is how large are the gzip files in the first place?


I work for the biggest webshop in the Netherlands and I'm facing a set of
logfiles that are very often  1 GB each and are gzipped.
The first thing we do with then is parse and disect each line in the very
first mapper. Then we store the result in (snappy compressed) avro files.

I don't disagree, I just want to have a solid argument in favor of it...


:)

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Niels Basjes
Hi,

On Wed, Feb 29, 2012 at 16:52, Edward Capriolo edlinuxg...@gmail.comwrote:
...

 But being able to generate split info for them and processing them
 would be good as well. I remember that was a hot thing to do with lzo
 back in the day. The pain of once overing the gz files to generate the
 split info is detracting but it is nice to know it is there if you
 want it.


Note that the solution I created (HADOOP-7076) does not require any
preprocessing.
It can split ANY gzipped file as-is.
The downside is that this effectively costs some additional performance
because the task has to decompress the first part of the file that is to be
discarded.

The other two ways of splitting gzipped files either require
- creating come kind of compression index before actually using the file
(HADOOP-6153)
- creating a file in a format that is gerenated in such a way that it is
really a set of concatenated gzipped files. (HADOOP-7909)

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Should splittable Gzip be a core hadoop feature?

2012-02-28 Thread Niels Basjes
Hi,

Some time ago I had an idea and implemented it.

Normally you can only run a single gzipped input file through a single
mapper and thus only on a single CPU core.
What I created makes it possible to process a Gzipped file in such a way
that it can run on several mappers in parallel.

I've put the javadoc I created on my homepage so you can read more about
the details.
http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec

Now the question that was raised by one of the people reviewing this code
was: Should this implementation be part of the core Hadoop feature set?
The main reason that was given is that this needs a bit more understanding
on what is happening and as such cannot be enabled by default.

I would like to hear from the Hadoop Core/Map reduce users what you think.

Should this be
- a part of the default Hadoop feature set so that anyone can simply enable
it by setting the right configuration?
- a separate library?
- a nice idea I had fun building but that no one needs?
- ... ?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Merge sorting reduce output files

2012-02-28 Thread Niels Basjes
Hi,

We have a job that outputs a set of files that are several hundred MB of
text each.
Using the comparators and such we can produce output files that are each
sorted by themselves.

What we want is to have one giant outputfile (outside of the cluster) that
is sorted.

Now we see the following options:
1) Run the last job with 1 reducer. This is not really an option because
that would put a significant part of the processing time through 1 cpu
(this would take too long).
2) Create an additional job that sorts the existing files and has 1 reducer.
3) Download all of the files and run the standard commandline tool sort
-m
4) Install HDFS fuse and run the standard commandline tool sort -m
5) Create an hadoop specific tool that can do hadoop fs -text and sort
-m in one go.

During our discussion we were wondering: What is the best way of doing this?
What do you recommend?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Merge sorting reduce output files

2012-02-28 Thread Niels Basjes
Hi Robert,

On Tue, Feb 28, 2012 at 21:41, Robert Evans ev...@yahoo-inc.com wrote:

  I would recommend that you do what terrasort does and use a different
 partitioner, to ensure that all keys within a given range will go to a
 single reducer.  If your partitioner is set up correctly then all you have
 to do is to concatenate the files together, if you even need to do that.

 Look at TotalOrderPartitioner.  It should do what you want.


I know about that partitioner.
The trouble I have is comming up with a partitioning that evenly balances
the data for this specific problem.
Taking a sample and base the partitioning on that (like the one used in
terrasort) wouldn't help.
The data has a special distribution...


Niels Basjes




 --Bobby Evans


 On 2/28/12 2:10 PM, Niels Basjes ni...@basjes.nl wrote:

 Hi,

 We have a job that outputs a set of files that are several hundred MB of
 text each.
 Using the comparators and such we can produce output files that are each
 sorted by themselves.

 What we want is to have one giant outputfile (outside of the cluster) that
 is sorted.

 Now we see the following options:
 1) Run the last job with 1 reducer. This is not really an option because
 that would put a significant part of the processing time through 1 cpu
 (this would take too long).
 2) Create an additional job that sorts the existing files and has 1
 reducer.
 3) Download all of the files and run the standard commandline tool sort
 -m
 4) Install HDFS fuse and run the standard commandline tool sort -m
 5) Create an hadoop specific tool that can do hadoop fs -text and sort
 -m in one go.

 During our discussion we were wondering: What is the best way of doing
 this?
 What do you recommend?




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Should splittable Gzip be a core hadoop feature?

2012-02-28 Thread Niels Basjes
Hi,

Some time ago I had an idea and implemented it.

Normally you can only run a single gzipped input file through a single
mapper and thus only on a single CPU core.
What I created makes it possible to process a Gzipped file in such a way
that it can run on several mappers in parallel.

I've put the javadoc I created on my homepage so you can read more about
the details.
http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec

Now the question that was raised by one of the people reviewing this code
was: Should this implementation be part of the core Hadoop feature set?
The main reason that was given is that this needs a bit more understanding
on what is happening and as such cannot be enabled by default.

I would like to hear from the Hadoop Core/Map reduce users what you think.

Should this be
- a part of the default Hadoop feature set so that anyone can simply enable
it by setting the right configuration?
- a separate library?
- a nice idea I had fun building but that no one needs?
- ... ?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: How to Create an effective chained MapReduce program.

2011-09-05 Thread Niels Basjes
Hi,

In the past i've had the same situation where I needed the data for
debugging. Back then I chose to create a second job with simply
SequenceFileInputFormat, IdentityMapper, IdentityReducer and finally
TextOutputFormat.

In my situation that worked great for my purpose.

-- 
Met vriendelijke groet,
Niels Basjes

Op 6 sep. 2011 01:54 schreef ilyal levin nipponil...@gmail.com het
volgende:

 o.k , so now i'm using SequenceFileInputFormat
and SequenceFileOutputFormat and it works fine but the output of the reducer
is
 now a binary file (not txt) so i can't understand the data. how can i
solve this? i need the data (in txt form ) of the Intermediate stages in the
chain.

 Thanks


 On Tue, Sep 6, 2011 at 1:33 AM, ilyal levin nipponil...@gmail.com wrote:

 Thanks for the help.


 On Mon, Sep 5, 2011 at 10:50 PM, Roger Chen rogc...@ucdavis.edu wrote:

 The binary file will allow you to pass the output from the first reducer
to the second mapper. For example, if you outputed Text, IntWritable from
the first one in SequenceFileOutputFormat, then you are able to retrieve
Text, IntWritable input at the head of the second mapper. The idea of
chaining is that you know what kind of output the first reducer is going to
give already, and that you want to perform some secondary operation on it.

 One last thing on chaining jobs: it's often worth looking to see if you
can consolidate all of your separate map and reduce tasks into a single
map/reduce operation. There are many situations where it is more intuitive
to write a number of map/reduce operations and chain them together, but more
efficient to have just a single operation.



 On Mon, Sep 5, 2011 at 12:21 PM, ilyal levin nipponil...@gmail.com
wrote:

 Thanks for the reply.
 I tried it but it creates a binary file which i can not understand (i
need the result of the first job).
 The other thing is how can i use this file in the next chained mapper?
i.e how can i retrieve the keys and the values in the map function?


 Ilyal


 On Mon, Sep 5, 2011 at 7:41 PM, Joey Echeverria j...@cloudera.com
wrote:

 Have you tried SequenceFileOutputFormat and SequenceFileInputFormat?

 -Joey

 On Mon, Sep 5, 2011 at 11:49 AM, ilyal levin nipponil...@gmail.com
wrote:
  Hi
  I'm trying to write a chained mapreduce program. i'm doing so with a
simple
  loop where in each iteration i
  create a job ,execute it and every time the current job's output is
the next
  job's input.
  how can i configure the outputFormat of the current job and the
inputFormat
  of the next job so that
  i will not use the TextInputFormat (TextOutputFormat), because if i
do use
  it, i need to parse the input file in the Map function?
  i.e if possible i want the next job to consider the input file as
  key,value and not plain Text.
  Thanks a lot.
 
 
 



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434





 --
 Roger Chen
 UC Davis Genome Center





Re: Excuting a shell script inside the HDFS

2011-08-16 Thread Niels Basjes
Yes, that way it could work.
I'm just wondering ... Why would you want to have a script like this in
HDFS?

Met vriendelijk groet,

Niels Basjes
Op 16 aug. 2011 06:49 schreef Friso van Vollenhoven 
fvanvollenho...@xebia.com het volgende:

 hadoop fs -cat /path/on/hdfs/script.sh | bash

 Should work (if you use bash).

 Does anyone have access to HDFS or do you use HDFS permissions? Executing
arbitrary stuff that comes out of a network filesystem is not always a good
idea from a security perspective.


 Friso


 On 15 aug. 2011, at 23:51, Kadu canGica Eduardo wrote:

  Hi,
 
  i'm trying without success to execute a shell script that is inside the
HDFS. Is this possible? If it is, how can i do this?
 
  Since ever, thanks.
  Carlos.



Re: How to select random n records using mapreduce ?

2011-06-27 Thread Niels Basjes
The only solution I can think of is by creating a counter in Hadoop
that is incremented each time a mapper lets a record through.
As soon as the value reaches a preselected value the mappers simply
discard the additional input they receive.

Note that this will not at all be random yet it's the best I can
come up with right now.

HTH

On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote:

 Hi all,
 I'd like to select random N records from a large amount of data using
 hadoop, just wonder how can I archive this ? Currently my idea is that let
 each mapper task select N / mapper_number records. Does anyone has such
 experience ?

 --
 Best Regards

 Jeff Zhang




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: AW: How to split a big file in HDFS by size

2011-06-21 Thread Niels Basjes
Hi,

On Tue, Jun 21, 2011 at 16:14, Mapred Learn mapred.le...@gmail.com wrote:
 The problem is when 1 text file goes on HDFS as 60 GB file, one mapper takes
 more than an hour to convert it to sequence file and finally fails.

 I was thinking how to split it from the client box before uploading to HDFS.

Have a look at this:

http://stackoverflow.com/questions/3960651/splitting-gzipped-logfiles-without-storing-the-ungzipped-splits-on-disk


 If I read file and split it with filestream. Read() based on size, it takes
 2 hours to process 1 60 gb file and upload on HDFS as 120 500 mb files.
 Sent from my iPhone
 On Jun 21, 2011, at 2:57 AM, Evert Lammerts evert.lamme...@sara.nl wrote:

 What we did was on not-hadoop hardware. We streamed the file from a storage
 cluster to a single machine and cut it up while streaming the pieces back to
 the storage cluster. That will probably not work for you, unless you have
 the hardware for it. But then still its inefficient.



 You should be able to unzip your file in a MR job. If you still want to use
 compression you can install LZO and rezip the file from within the same job.
 (LZO uses block-compression, which allows Hadoop to process all blocks in
 parallel.) Note that you’ll need enough storage capacity. I don’t have
 example code, but I’m guessing Google can help.







 From: Mapred Learn [mailto:mapred.le...@gmail.com]
 Sent: maandag 20 juni 2011 18:09
 To: Niels Basjes; Evert Lammerts
 Subject: Re: AW: How to split a big file in HDFS by size



 Thanks for sharing.



 Could you guys share how are you divinding your 2.7 TB into 10 Gb file each
 on HDFS ? That wud be helpful for me !







 On Mon, Jun 20, 2011 at 8:39 AM, Marcos Ortiz mlor...@uci.cu wrote:

 Evert Lammerts at Sara.nl did something seemed to your problem, spliting a
 big 2.7 TB file to chunks of 10 GB.
 This work was presented on the BioAssist Programmers' Day on January of this
 year and its name was
 Large-Scale Data Storage and Processing for Scientist in The Netherlands

 http://www.slideshare.net/evertlammerts

 P.D: I sent the message with a copy to him

 El 6/20/2011 10:38 AM, Niels Basjes escribió:



 Hi,

 On Mon, Jun 20, 2011 at 16:13, Mapred Learnmapred.le...@gmail.com  wrote:


 But this file is a gzipped text file. In this case, it will only go to 1
 mapper than the case if it was
 split into 60 1 GB files which will make map-red job finish earlier than one
 60 GB file as it will
 Hv 60 mappers running in parallel. Isn't it so ?


 Yes, that is very true.



 --

 Marcos Luís Ortíz Valmaseda
  Software Engineer (UCI)
  http://marcosluis2186.posterous.com
  http://twitter.com/marcosluis2186






-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: AW: How to split a big file in HDFS by size

2011-06-20 Thread Niels Basjes
Hi,

On Mon, Jun 20, 2011 at 16:13, Mapred Learn mapred.le...@gmail.com wrote:
 But this file is a gzipped text file. In this case, it will only go to 1 
 mapper than the case if it was
 split into 60 1 GB files which will make map-red job finish earlier than one 
 60 GB file as it will
 Hv 60 mappers running in parallel. Isn't it so ?

Yes, that is very true.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Using df instead of du to calculate datanode space

2011-05-21 Thread Niels Basjes
Hi,

Although I like the thought of doing things smarter I'm never ever
going to change core Unix/Linux applications for the sake of a
specific application. Linux handles scripts and binaries completely
different with regards to security. So how do you know for sure (I
mean 100% sure, not just 99.% sure) that you haven't broken
any other functionality needed to keep your system sane?

Why don't you issue a feature request so this needless disk io can
be fixed as part of the base code of Hadoop (instead of breaking the
underlying OS)?

Niels

2011/5/21 Edward Capriolo edlinuxg...@gmail.com:
 Good job. I brought this up an another thread, but was told it was not a
 problem. Good thing I'm not crazy.

 On Sat, May 21, 2011 at 12:42 AM, Joe Stein
 charmal...@allthingshadoop.comwrote:

 I came up with a nice little hack to trick hadoop into calculating disk
 usage with df instead of du


 http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/

 I am running this in production, works like a charm and already
 seeing benefit, woot!

 I hope it works well for others too.

 /*
 Joe Stein
 http://www.twitter.com/allthingshadoop
 */





-- 
Met vriendelijke groeten,

Niels Basjes


Including external libraries in my job.

2011-05-03 Thread Niels Basjes
Hi,

I've written my first very simple job that does something with hbase.

Now when I try to submit my jar in my cluster I get this:

[nbasjes@master ~/src/catalogloader/run]$ hadoop jar
catalogloader-1.0-SNAPSHOT.jar nl.basjes.catalogloader.Loader
/user/nbasjes/Minicatalog.xml
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/HBaseConfiguration
at nl.basjes.catalogloader.Loader.main(Loader.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
...

I've found this blog post that promises help
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

Quote:
1. Include the JAR in the “-libjars” command line option of the
`hadoop jar …` command. The jar will be placed in distributed cache
and will be made available to all of the job’s task attempts. 

However one of the comments states:
Unfortunately, method 1 only work before 0.18, it doesn’t work in 0.20.

Indeed, I can't get it to work this way.

I've tried something as simple as:
export 
HADOOP_CLASSPATH=/usr/lib/hbase/hbase-0.90.1-cdh3u0.jar:/usr/lib/zookeeper/zookeeper-3.3.3-cdh3u0.jar
and then run the job but that (as expected) simply means the tasks on
the processing nodes fail with a similar error:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.hadoop.hbase.mapreduce.TableOutputFormat
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996)
at 
org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:248)
at org.apache.hadoop.mapred.Task.initialize(Task.java:486)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
...

So what is the correct way of doing this?

-- 
Met vriendelijke groeten,

Niels Basjes


Unsplittable files on HDFS

2011-04-27 Thread Niels Basjes
Hi,

In some scenarios you have gzipped files as input for your map reduce
job (apache logfiles is a common example).
Now some of those files are several hundred megabytes and as such will
be split by HDFS in several blocks.

When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2)

Total number of blocks: 2
25063947863662497:   10.10.138.62:50010 10.10.138.61:50010
1014249434553595747:   10.10.138.64:50010   10.10.138.63:50010

As you can see the file has been distributed over all 4 nodes.

When actually reading those files they are unsplittable due to the
nature of the Gzip codec.
So a job will (in the above example) ALWAYS need to pull the other
half of the file over the network, if a file is bigger and the
cluster is bigger then the percentage of the file that goes over the
network will probably increase.

Now if I can tell HDFS that a .gz file should always be 100% local
for the node that will be doing the processing this would reduce the
network IO during the job dramatically.
Especially if you want to run several jobs against the same input.

So my question is: Is there a way to force/tell HDFS to make sure that
a datanode that has blocks of this file must always have ALL blocks of
this file?

-- 
Best regards,

Niels Basjes


Re: Unsplittable files on HDFS

2011-04-27 Thread Niels Basjes
Hi,

I did the following with a 1.6GB file
   hadoop fs -Ddfs.block.size=2147483648 -put
/home/nbasjes/access-2010-11-29.log.gz /user/nbasjes
and I got

Total number of blocks: 1
4189183682512190568:10.10.138.61:50010  
10.10.138.62:50010

Yes, that does the trick. Thank you.

Niels

2011/4/27 Harsh J ha...@cloudera.com:
 Hey Niels,

 The block size is a per-file property. Would putting/creating these
 gzip files on the DFS with a very high block size (such that it
 doesn't split across for such files) be a valid solution to your
 problem here?

 On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes ni...@basjes.nl wrote:
 Hi,

 In some scenarios you have gzipped files as input for your map reduce
 job (apache logfiles is a common example).
 Now some of those files are several hundred megabytes and as such will
 be split by HDFS in several blocks.

 When looking at a real 116MiB file on HDFS I see this (4 nodes, replication 
 = 2)

 Total number of blocks: 2
 25063947863662497:           10.10.138.62:50010         10.10.138.61:50010
 1014249434553595747:   10.10.138.64:50010               10.10.138.63:50010

 As you can see the file has been distributed over all 4 nodes.

 When actually reading those files they are unsplittable due to the
 nature of the Gzip codec.
 So a job will (in the above example) ALWAYS need to pull the other
 half of the file over the network, if a file is bigger and the
 cluster is bigger then the percentage of the file that goes over the
 network will probably increase.

 Now if I can tell HDFS that a .gz file should always be 100% local
 for the node that will be doing the processing this would reduce the
 network IO during the job dramatically.
 Especially if you want to run several jobs against the same input.

 So my question is: Is there a way to force/tell HDFS to make sure that
 a datanode that has blocks of this file must always have ALL blocks of
 this file?

 --
 Best regards,

 Niels Basjes




 --
 Harsh J




-- 
Met vriendelijke groeten,

Niels Basjes


Re: hadoop mr cluster mode on my laptop?

2011-04-18 Thread Niels Basjes
Hi,

You should be doing the setup for what is called Pseudo-distributed mode.
Have a look at this:
http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed

Niels

2011/4/18 Pedro Costa psdc1...@gmail.com:
 Hi,

 1 - I would like to run Hadoop MR in my laptop, but with the cluster
 mode configuration. I've put a slave file with the following content:
 [code]
 127.0.0.1
 localhost
 192.168.10.1
 mylaptop
 [/code]


 the 192.168.10.1 is the IP of the machine and the mylaptop is the
 logical name of the address. Is this a well configured slaves file to
 do what I want?

 2 - I was also thinking taking advantage of my CPU to put Hadoop MR as
 cluster mode.
 This is my CPU characteristics:

 Essentials
 Processor Number i5-460M
 # of Cores              2
 # of Threads    4
 Clock Speed      2.53 GHz
 Max Turbo Frequency 2.8 GHz
 Intel® Smart Cache 3 MB
 Instruction Set   64-bit

 I was thinking in running a Tasktracker in each Core, although I don't
 know how to do it.

 Any help to install Hadoop MR in cluster mode on my laptop?

 Thanks,
 --
 Pedro




-- 
Met vriendelijke groeten,

Niels Basjes


Re: Small linux distros to run hadoop ?

2011-04-15 Thread Niels Basjes
Hi,

2011/4/15 web service wbs...@gmail.com:
    what is the smallest linux system/distros to run hadoop ? I would want to
 run small linux vms and run jobs on them.

I usually use a fully stripped CentOS 5 to run cluster nodes. Works
perfectly and can be fully automated using the kickstart scripting for
anaconda (= Redhat installer).

-- 
Met vriendelijke groeten,

Niels Basjes


Re: Re-generate datanode storageID?

2011-03-24 Thread Niels Basjes
Hi,

To solve that simply do the following on the problematic nodes:
1) Stop the datanode (probably not running)
2) Remove everything inside the .../cache/hdfs/
3) Start the datanode again.

Note: With cloudera always use service way to stop/start hadoop software!
service hadoop-0.20-datanode stop

2011/3/24 Marc Leavitt leavitt...@gmail.com:
 I am setting up a (very) small Hadoop/CDH3 beta 4 cluster in virtual machines 
 to do some initial feasibility work.  I proceeded by progressing through the 
 Cloudera documentation standalone - pseudo-cluster - cluster with a single 
 VM and then, when I had it stable(-ish) I copied the VM to a couple of slave 
 images.

 The good news is that all three tasktrackers show up.
 The bad news is that only one datanode shows up.

 And, after after some research, I am pretty sure I know why - each of the 
 datanodes is trying to claim the same storageID (as defined in 
 .../cache/hdfs/dfs/data/current/VERSION.

 So, my question is How do I resolve the collision of the storageIDs?

 Thanks!
 -mgl








-- 
Met vriendelijke groeten,

Niels Basjes


Re: TextInputFormat and Gzip encoding - wordcount displaying binary data

2011-03-21 Thread Niels Basjes
Hi,

2011/3/21 Saptarshi Guha saptarshi.g...@gmail.com:
 It's frustrating to be dealing with these simple problems (and I know
 the fault is mine, i'm missing something).
 I'm running word count (from 0.20-2) on a gzip file (very small), the
 output has binary characters.
 When I run the same on the ungzipped file, the output is correct ascii.

 I'm using the native gzip library. The command is

  hadoop jar /usr/lib/hadoop-0.20/hadoop-examples-0.20.2-CDH3B4.jar
 wordcount /user/sguha/tmp/o.zip /user/sguha/tmp/o.wc.zip

 (zip is gzip)

No, .zip is pkzip and .gz is gzip.

The applicable hadoop code actually chooses the decompressor on the
extention of the filename.

-- 
Niels Basjes


Re: File formats in Hadoop

2011-03-20 Thread Niels Basjes
And then there is the matter of how you put the data in the file. I've
heard that some people write the data as protocolbuffers into the
sequence file.

2011/3/19 Harsh J qwertyman...@gmail.com:
 Hello,

 On Sat, Mar 19, 2011 at 9:31 PM, Weishung Chung weish...@gmail.com wrote:
 I am browsing through the hadoop.io package and was wondering what other
 file formats are available in hadoop other than SequenceFile and TFile?

 Additionally, on Hadoop, there're MapFiles/SetFiles (Derivative of
 SequenceFiles, if you need maps/sets), and IFiles (Used by the
 map-output buffers to produce a key-value file for Reducers to use,
 internal use only).

 Apache Hive use RCFiles, which is very interesting too. Apache Avro
 provides Avro-Datafiles that are designed for use with Hadoop
 Map/Reduce + Avro-serialized data.

 I'm not sure of this one, but Pig probably was implementing a
 table-file-like solution of their own a while ago. Howl?

 --
 Harsh J
 http://harshj.com




-- 
Met vriendelijke groeten,

Niels Basjes


Re: Efficiently partition broadly distributed keys

2011-03-10 Thread Niels Basjes
If I understand your problem correctly you actually need some way of
knowing if you need to chop a large set with a specific key in to
subsets.
In mapreduce the map only has information about a single key at a
time. So you need something extra.

One way of handling this is to start by doing a statistical run first
and use the output to determine which keys can be chopped.

So first do a MR to determine per key how many records and/or bytes
need to be handled.
In the reducer you have a lower limit that ensures the reducer only
outputs the keys that need to be chopped.
In the second pass you do your actual algorithm with one twist: In the
mapper you use the output of the first run to determine if the key
needs to be rewritten and in how many variants. You then randomly (!)
choose a variant.

Example of what i mean:
Assume you have 100 records with key A and 1 records with key B.
You determine that you want key groups of an approximate maximum of
1000 records.
The first pass outputs that key B must be split into 10 variants (=1/1000).
Then the second MAP will transform a key B into one of these randomly:
 B-1, B-2, B-3 ...
A record with key A will remain unchanged.

Disadvantage:
   Each run will show different results.
   Only works if the set of keys that needs to be chopped is small
enough so you can have it in memory in the call to the second map.

HTH

Niels Basjes


2011/3/10 Luca Aiello alu...@yahoo-inc.com:
 Dear users,

 hope this is the right list to submit this one, otherwise I apologize.

 I'd like to have your opinion about a problem that I'm facing on MapReduce 
 framework. I am writing my code in Java and running on a grid.

 I have a textual input structured in key, value pairs. My task is to make 
 the cartesian product of all the values that have the same key.
 I can do so it simply using key as my map key, so that every value with the 
 same key is put in the same reducer, where I can easily process them and 
 obtain the cartesian.

 However, my keys are not uniformly distributed, the distribution is very 
 broad. As a result of this, the output of my reducers will be very unbalanced 
 and I will have many small files (some KB) and a bunch of huge files (tens of 
 GB). A sub-optimal yet acceptable approximation of my task would be to make 
 the cartesian product of smaller chunks of values for very frequent keys, so 
 that the load is distributed evenly among reducers. I am wondering how can I 
 do this in the most efficient/elegant way.

 It appears to me that using a customized Partitioner is not the right way to 
 act, since records with the same key have still to be mapped together (am I 
 right?).
 The only solution that comes into my mind is to split the key space 
 artificially insider the mapper (e.g., for a frequent key ABC, map the 
 values on the reducers using keys like ABC1, ABC2, an so on). This would 
 require an additional post-processing cleanup phase to retrieve the original 
 keys.

 Do you know a better, possibly automatic way to perform this task?

 Thank you!
 Best

 Luca





-- 
Met vriendelijke groeten,

Niels Basjes


Re: Efficiently partition broadly distributed keys

2011-03-10 Thread Niels Basjes
Hi Luca,

2011/3/10 Luca Aiello alu...@yahoo-inc.com:

 thanks for the quick response. So in your opinion there is nothing like a 
 hadoop embedded tool to do this. This is what I suspected indeed.

The mapreduce model simply uses the key as the pivot of the
processing. In your application specific situation your looking for a
way to break that model in a smart way. So no, Hadoop doesn't do
that.

 Since the key, value pairs in the initial dataset are randomly partitioned 
 in the input files,
 I suppose that I can avoid the initial statistic step and do statistics on 
 the fly.
 For example, if I have 3 keys A,B,C with probability 0.7, 0.2, and 0.1, every 
 mapper will receive,
 on average, a input composed by 70% of A keys, 20% of Bs and 10% of Cs
 and so it will know that the key,value to be chunked are de As.

If you know this in advance you can surely put that into the the
mapper. But that simply means you've done the statistics in advance in
some way.
I fail to see how you could do this on the fly without prior knowledge.

As a slightly smarter way of doing something like this; Have a look at
the terasort example.
There a sample of the data is used to estimate the distribution.

 This can introduce some errors but it should produce a output which is quite 
 uniformly distributed.

 Thanks again!

You're welcome.

Niels


 On Mar 10, 2011, at 12:23 PM, Niels Basjes wrote:

 If I understand your problem correctly you actually need some way of
 knowing if you need to chop a large set with a specific key in to
 subsets.
 In mapreduce the map only has information about a single key at a
 time. So you need something extra.

 One way of handling this is to start by doing a statistical run first
 and use the output to determine which keys can be chopped.

 So first do a MR to determine per key how many records and/or bytes
 need to be handled.
 In the reducer you have a lower limit that ensures the reducer only
 outputs the keys that need to be chopped.
 In the second pass you do your actual algorithm with one twist: In the
 mapper you use the output of the first run to determine if the key
 needs to be rewritten and in how many variants. You then randomly (!)
 choose a variant.

 Example of what i mean:
 Assume you have 100 records with key A and 1 records with key B.
 You determine that you want key groups of an approximate maximum of
 1000 records.
 The first pass outputs that key B must be split into 10 variants 
 (=1/1000).
 Then the second MAP will transform a key B into one of these randomly:
 B-1, B-2, B-3 ...
 A record with key A will remain unchanged.

 Disadvantage:
   Each run will show different results.
   Only works if the set of keys that needs to be chopped is small
 enough so you can have it in memory in the call to the second map.

 HTH

 Niels Basjes


 2011/3/10 Luca Aiello alu...@yahoo-inc.com:
 Dear users,

 hope this is the right list to submit this one, otherwise I apologize.

 I'd like to have your opinion about a problem that I'm facing on MapReduce 
 framework. I am writing my code in Java and running on a grid.

 I have a textual input structured in key, value pairs. My task is to make 
 the cartesian product of all the values that have the same key.
 I can do so it simply using key as my map key, so that every value with 
 the same key is put in the same reducer, where I can easily process them 
 and obtain the cartesian.

 However, my keys are not uniformly distributed, the distribution is very 
 broad. As a result of this, the output of my reducers will be very 
 unbalanced and I will have many small files (some KB) and a bunch of huge 
 files (tens of GB). A sub-optimal yet acceptable approximation of my task 
 would be to make the cartesian product of smaller chunks of values for very 
 frequent keys, so that the load is distributed evenly among reducers. I am 
 wondering how can I do this in the most efficient/elegant way.

 It appears to me that using a customized Partitioner is not the right way 
 to act, since records with the same key have still to be mapped together 
 (am I right?).
 The only solution that comes into my mind is to split the key space 
 artificially insider the mapper (e.g., for a frequent key ABC, map the 
 values on the reducers using keys like ABC1, ABC2, an so on). This 
 would require an additional post-processing cleanup phase to retrieve the 
 original keys.

 Do you know a better, possibly automatic way to perform this task?

 Thank you!
 Best

 Luca





 --
 Met vriendelijke groeten,

 Niels Basjes





-- 
Met vriendelijke groeten,

Niels Basjes


Re: Comparison between Gzip and LZO

2011-03-02 Thread Niels Basjes
Question: Are you 100% sure that nothing else was running on that
system during the tests?
No cron jobs, no makewhatis or updatedb?

P.S. There is a permission issue with downloading one of the files.

2011/3/2 José Vinícius Pimenta Coletto jvcole...@gmail.com:
 Hi,

 I'm making a comparison between the following compression methods: gzip
 and lzo provided by Hadoop and gzip from package java.util.zip.
 The test consists of compression and decompression of approximately 92,000
 files with an average size of 2kb, however the decompression time of lzo is
 twice the decompression time of gzip provided by Hadoop, it does not seem
 right.
 The results obtained in the test are:

      Method         |   Bytes   |               Compression
       |                    Decompression
         -           |     -     | Total Time(with i/o)  Time     Speed
        | Total Time(with i/o)  Time      Speed
 Gzip (Haddop)        | 200876304 | 121.454s              43.167s
  4,653,424.079 B/s | 332.305s              111.806s   1,796,635.326 B/s
 Lzo                  | 200876304 | 120.564s              54.072s
  3,714,914.621 B/s | 509.371s              184.906s   1,086,368.904 B/s
 Gzip (java.util.zip) | 200876304 | 148.014s              63.414s
  3,167,647.371 B/s | 483.148s              4.528s    44,360,682.244 B/s

 You can see the code I'm using to the test here:
 http://www.linux.ime.usp.br/~jvcoletto/compression/

 Can anyone explain me why am I getting these results?
 Thanks.




-- 
Met vriendelijke groeten,

Niels Basjes


Re: How to make a CGI with HBase?

2011-03-01 Thread Niels Basjes
Hi,

Is there a very basic example on such a KISS interface towards a HBase table?

Thanks

2011/3/1 Michael Segel michael_se...@hotmail.com:

 I'd say skip trying to shoe horn JDBC interface on HBase, and just write 
 straight to it.
 Remember K.I.S.S.

 -Mike


 Subject: Re: How to make a CGI with HBase?
 From: goks...@gmail.com
 Date: Mon, 28 Feb 2011 21:06:52 -0800
 CC: goks...@gmail.com
 To: common-user@hadoop.apache.org

 Java servlets for web development with databases is a learning curve.  HBase 
 is a learning curve. You might want to learn one at a time :)  If you code 
 the site in Java/JSPs, you want a JDBC driver for HBase. You can code calls 
 to HBase directly in a JSP without writing Java.

 Lance


 On Feb 28, 2011, at 6:19 PM, edward choi wrote:

  Thanks for the reply.
  Didn't know that Thrift was for such purpose.
  Servlet and JSP is totally new to me. I skimmed through the concept on the 
  internet and they look fascinating.
  I think I am gonna give servlet and JSP a try.
 
  On 2011. 3. 1., at 오전 12:51, Usman Waheed usm...@opera.com wrote:
  HI,
 
  I have been using the Thrift Perl API to connect to Hbase for my web app. 
  At the moment i only perform random reads and scans based on date ranges 
  and some other search criteria.
  It works and I am still testing performance.
 
  -Usman
 
  On Mon, Feb 28, 2011 at 2:37 PM, edward choi mp2...@gmail.com wrote:
  Hi,
 
  I am planning to make a search engine for news articles. It will 
  probably
  have over several billions of news articles so I thought HBase is the 
  way to
  go.
 
  However, I am very new to CGI. All I know is that you use php, python or
  java script with HTML to make a web site and communicate with the 
  backend
  database such as MySQL.
 
  But I am going to use HBase, not MySQL, and I can't seem to find a 
  script
  language that provides any form of API to communicate with HBase.
 
  So what do I do?
 
  Do I have to make a web site with pure Java? Is that even possible?
 
  It is possible, if you know things like JSP, Java servelets etc.
 
  For people comfortable with PHP or Python, I think Apache Thrift
  (http://wiki.apache.org/thrift/) is an alternative.
 
  -b
 
 
  Or is using Hbase as the backend Database a bad idea in the first place?
 
 
 
  --
  Using Opera's revolutionary email client: http://www.opera.com/mail/





-- 
Met vriendelijke groeten,

Niels Basjes


Re: When use hadoop mapreduce?

2011-02-18 Thread Niels Basjes
Hi,

2011/2/17 Pedro Costa psdc1...@gmail.com:
 I like to know, depending on my problem, when should I use or not use
 Hadoop MapReduce? Does exist any list that advices me to use or not to
 use MapReduce?

The summary I usually give goes something like this:
IF your computation takes too long on a single system AND you can
split the work up into a lot of smaller pieces AND the work is batch
oriented AND what you are doing involves a lot of data THEN you should
consider trying it out.
There are other systems for scalable computing (for example Gridgain)
that may be better in other situations.

HTH

-- 
Met vriendelijke groeten,

Niels Basjes


Re: Hadoop in Real time applications

2011-02-17 Thread Niels Basjes
2011/2/17 Karthik Kumar karthik84ku...@gmail.com:
 Can Hadoop be used for Real time Applications such as banking solutions...

Hadoop consists of several components.
Components like HDFS and HBase are quite suitable for interactive
solutions (as in: I usually get an answer within 0.x seconds).
If you really need realtime (as in: I want a guarantee that I have
an answer within 0.x seconds) the answer is: No, HDFS/HBase cannot
guarantee that.
Other components like MapReduce (and Hive which run on top of
MapReduce) are purely batch oriented.

-- 
Met vriendelijke groeten,

Niels Basjes


Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

2011-01-31 Thread Niels Basjes
Hi,

2011/1/31 Sean Bigdatafun sean.bigdata...@gmail.com:
 GZIP is not splittable.

Correct, gzip is a stream compression system which effectively means
you can only start at the beginning of the data with decompressing.

 Does that mean a GZIP block compressed sequencefile can't take advantage of 
 MR parallelism?

AFAIK it should be splittable in the same blocks as the compression was done.

 How to control the size of block to be compressed in SequenceFile?

Can't help you with that one.

-- 
Met vriendelijke groeten,

Niels Basjes


Re: Restricting number of records from map output

2011-01-14 Thread Niels Basjes
Hi,

 I have a sort job consisting of only the Mapper (no Reducer) task. I want my
 results to contain only the top n records. Is there any way of restricting
 the number of records that are emitted by the Mappers?

 Basically I am looking to see if there is an equivalent of achieving
 the behavior similar to LIMIT in SQL queries.

I think I understand your goal. However the question is toward (what I
think) is the wrong solution.

A mapper gets 1 record as input and only knows about that one record.
There is no way to limit there.

If you implement a simple reducer you can very easily let is stop
reading the input iterator after N records and limit the output in
that way.

Doing it in the reducer also allows you to easily add a concept of
Top N by using the Secondary Sort trick to sort the input before
it arrives at the reducer.

HTH

Niels Basjes


Re: TeraSort question.

2011-01-11 Thread Niels Basjes
Raj,

Have a look at the graph shown here:
http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_1.1_--_Generating_Task_Timelines

It should make clear that the number of tasks varies greatly over the
lifetime of a job.
Depending on the nodes available this may leave node idle.

Niels

2011/1/11 Raj V rajv...@yahoo.com:
 Ted


 Thanks. I have all the graphs I need that include, map reduce timeline, 
 system activity for all the nodes when the sort was running. I will publish 
 them once I have them in some presentable format.,

 For legal reasons, I really don't want to send the complete job histiory 
 files.

 My question is still this. When running terasort, would the CPU, disk and 
 network utilization of all the nodes be more or less similar or completely 
 different.

 Sometime during the day, I will post the system data from 5 nodes and that 
 would probably explain my question better.

 Raj
 From: Ted Dunning tdunn...@maprtech.com
 To: common-user@hadoop.apache.org; Raj V rajv...@yahoo.com
 Cc:
 Sent: Tuesday, January 11, 2011 8:22:17 AM
 Subject: Re: TeraSort question.

 Raj,

 Do you have the job history files?  That would be very useful.  I would be
 happy to create some swimlane and related graphs for you if you can send me
 the history files.

 On Mon, Jan 10, 2011 at 9:06 PM, Raj V rajv...@yahoo.com wrote:

 All,

 I have been running terasort on a 480 node hadoop cluster. I have also
 collected cpu,memory,disk, network statistics during this run. The system
 stats are quite intersting. I can post it when I have put them together in
 some presentable format ( if there is interest.). However while looking at
 the data, I noticed something interesting.

  I thought, intutively, that the all the systems in the cluster would have
 more or less similar behaviour ( time translation was possible) but the
 overall graph would look the same.,

 Just to confirm it I took 5 random nodes and looked at the CPU, disk
 ,network etc. activity when the sort was running. Strangeley enough, it was
 not so., Two of the 5 systems were seriously busy, big IO with lots of disk
 and network activity. The other three systems, CPU was more or less 100%
 idle, slight network and I/O.

 Is that normal and/or expected? SHouldn't all the nodes be utilized in more
 or less manner over the length of the run?

 I generated the data forf the sort using teragen. ( 128MB bloick size,
 replication =3).

 I would also be interested in other people timings of sort. Is there some
 place where people can post sort numbers ( not just the record.)

 I will post the actual graphs of the 5 nodes, if there is interest,
 tomorrow. ( Some logistical issues abt. posting them tonight)

 I am using CDH3B3, even though I think this is not specific to CDH3B3.

 Sorry for the cross post.

 Raj



-- 
Met vriendelijke groeten,

Niels Basjes


Re: Help: How to increase amont maptasks per job ?

2011-01-07 Thread Niels Basjes
You said you have a large amount of data.
How large is that approximately?
Did you compress the intermediate data (with what codec)?

Niels

2011/1/7 Tali K ncherr...@hotmail.com:

 According to the documentation, that parameter is for the number of
    tasks *per TaskTracker*.  I am asking about the number of tasks
    for the entire job and entire cluster.  That parameter is already
    set to 3, which is one less than the number of cores on each node's
    CPU, as recommended.In my question I stated   that
    82 tasks were run for the first job, yet only 4 for the second -
    both numbers being cluster-wide.



 Date: Fri, 7 Jan 2011 13:19:42 -0800
 Subject: Re: Help: How to increase amont maptasks per job ?
 From: yuzhih...@gmail.com
 To: common-user@hadoop.apache.org

 Set higher values for mapred.tasktracker.map.tasks.maximum (and
 mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml

 On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote:

 
 
 
 
  We have a jobs which runs in several map/reduce stages.  In the first job,
  a large number of map tasks -82  are initiated, as expected.
  And that cause all nodes to be used.
   In a
  later job, where we are still dealing with large amounts of
   data, only 4 map tasks are initiated, and that caused to use only 4 nodes.
  This stage is actually the
  workhorse of the job, and requires much more processing power than the
  initial stage.
   We are trying to understand why only a few map tasks are
  being used, as we are not getting the full advantage of our cluster.
 
 
 
 




-- 
Met vriendelijke groeten,

Niels Basjes


Re: FILE_BYTES_WRITTEN and HDFS_BYTES_WRITTEN

2010-11-30 Thread Niels Basjes
For some parts of a task the system stores information on the local
(non-HDFS) file system of the node that is actually running the job.
That is the FILE_.. Stuff written to HDFS is the HDFS_...

HTH

2010/11/30  psdc1...@gmail.com:
 When an hadoop MapReduce example is executed, at the end of the example it's
 showed a table with all the information about the execution, like the number
 of Map and Reduce tasks executed, the number of bytes read and written.

 In this information it exists 2 fields FILE_BYTES_WRITTEN and
 HDFS_BYTES_WRITTEN. What's the difference between these 2 fields?

 Thanks,
 Pedro



-- 
Met vriendelijke groeten,

Niels Basjes


Re: Control the number of Mappers

2010-11-25 Thread Niels Basjes
Hi,

2010/11/25 Shai Erera ser...@gmail.com:
 Is there a way to make MapReduce create exactly N Mappers? More
 specifically, if say my data can be split to 200 Mappers, and I have only
 100 cores, how can I ensure only 100 Mappers will be created? The number of
 cores is not something I know in advance, so writing a special InputFormat
 might be tricky, unless I can query Hadoop for the available # of cores (in
 the entire cluster).

You can configure on a node by node basis how many map and reduce
tasks can be started by the task tracker on that node.
This is done via the conf/mapred-site.xml using these two settings:
mapred.tasktracker.{map|reduce}.tasks.maximum

Have a look at this page for more information
http://hadoop.apache.org/common/docs/current/cluster_setup.html

-- 
Met vriendelijke groeten,

Niels Basjes


Re: Control the number of Mappers

2010-11-25 Thread Niels Basjes
Ah,

In that case this should answer your question:
http://wiki.apache.org/hadoop/HowManyMapsAndReduces


2010/11/25 Shai Erera ser...@gmail.com:
 I wasn't talking about how to configure the cluster to not invoke more than
 a certain # of Mappers simultaneously. Instead, I'd like to configure a
 (certain) job to invoke exactly N Mappers, where N is the number of cores in
 the cluster. Irregardless of the size of the data. This is not critical if
 it can't be done, but it can improve the performance of my job if it can be
 done.

 Thanks
 Shai

 On Thu, Nov 25, 2010 at 9:55 PM, Niels Basjes ni...@basjes.nl wrote:

 Hi,

 2010/11/25 Shai Erera ser...@gmail.com:
  Is there a way to make MapReduce create exactly N Mappers? More
  specifically, if say my data can be split to 200 Mappers, and I have
  only
  100 cores, how can I ensure only 100 Mappers will be created? The number
  of
  cores is not something I know in advance, so writing a special
  InputFormat
  might be tricky, unless I can query Hadoop for the available # of cores
  (in
  the entire cluster).

 You can configure on a node by node basis how many map and reduce
 tasks can be started by the task tracker on that node.
 This is done via the conf/mapred-site.xml using these two settings:
 mapred.tasktracker.{map|reduce}.tasks.maximum

 Have a look at this page for more information
 http://hadoop.apache.org/common/docs/current/cluster_setup.html

 --
 Met vriendelijke groeten,

 Niels Basjes





-- 
Met vriendelijke groeten,

Niels Basjes


Re: Predicting how many values will I see in a call to reduce?

2010-11-08 Thread Niels Basjes
Hi,

2010/11/7 Anthony Urso anthony.u...@gmail.com

 Is there any way to know how many values I will see in a call to
 reduce without first counting through them all with the iterator?

 Under 0.21? 0.20? 0.19?


I've looked for an answer to the same question a while ago and came to the
conclusion that you can't.
The main limit is that the Iterator does not have a size or length
method.

-- 
Met vriendelijke groeten,

Niels Basjes


Re: Duplicated entries with map job reading from HBase

2010-11-05 Thread Niels Basjes
Hi,

I don't know the answer (simply not enough information in your email) but
I'm willing to make a guess:
You are running on a system with two processing nodes?
If so then try removing the Combiner. The combiner is a performance
optimization and the whole processing should work without it.
Some times there is a design fault in the processing and the combiner
disrupts the processing.

HTH

Niels Basjes

2010/11/5 Adam Phelps a...@opendns.com

 I've noticed an odd behavior with a map-reduce job I've written which is
 reading data out of an HBase table.  After a couple days of poking at this I
 haven't been able to figure out the cause of the problem, so I figured I'd
 ask on here.

 (For reference I'm running with the cdh3b2 release)

 The problem is that it seems that every line from the HBase table is passed
 to the mappers twice, thus resulting in counts ending up as exactly double
 what they should be.

 I set up the job like this:

Scan scan = new Scan();
scan.addFamily(Bytes.toBytes(scanFamily));

TableMapReduceUtil.initTableMapperJob(table,
  scan,
  mapper,
  Text.class,
  LongWritable.class,
  job);
job.setCombinerClass(LongSumReducer.class);

job.setReducerClass(reducer);

 I've set up counters in the mapper to verify what is happening, so that I
 know for certain that the mapper is being called twice with the same bit of
 data.  I've also confirmed (using the hbase shell) that each entry appears
 only once in the table.

 Is there a known bug along these lines?  If not, does anyone have any
 thoughts on what might be causing this or where I'd start looking to
 diagnose?

 Thanks
 - Adam




-- 
Met vriendelijke groeten,

Niels Basjes


Understanding FileInputFormat and isSplittable.

2010-09-07 Thread Niels Basjes
Hi,

The last few weeks we built an application using Hadoop.
Because we're implementing against special logfiles (line oriented,
textual and gzipped) and we wanted to extract specific fields from
those file before putting it into our mapper. We chose to implement
our own derivative of the FileInputFormat class to do this.

All went well until we tried it with big (100MiB and bigger) files.
We then noticed that a lot of the values became doubled, and when we
put in full size production data for a test run we found single
events to be counted as 36.
It took a while to figure out what went wrong but essentially the root
cause was the fact that the isSplittable method returned true for
Gzipped files (which aren't splittable). The implementation that was
used was the one in FileInputFormat. The documentation for this method
states Is the given filename splitable? Usually, true, but if the
file is stream compressed, it will not be.. This documentation gave
us the illusion that the default FileInputFormat implementation would
handle compression correctly. Because it all worked with small Gzipped
files we never expected the real implementation of this method to be
return true;

Because of this true value somehow the framework decided to read
each input file fully the number of times it wanted to split it. With
really messy effects in our case.

The derived TextInputFormat class does have a compression aware
implementation of isSplittable.

Given my current knowledge of Hadoop; I would have chosen to let the
default isSplittable implementation (i.e. the one in FileInputFormat)
be either safe (always return false) or correct (return whatever
is right for the applicable compression). The latter would make the
implementation match what I would expect from the documentation.

I would like to understand the logic behind the current implementation
choice in relation to what I expected (mainly from the documentation).

Thanks for explaining.

-- 
Best regards,

Niels Basjes