Re: Error importing hbase table on new system
But is not that tool for reading regions not export (dumps) ?? -Håvard On Sun, Sep 27, 2015 at 6:23 PM, Ted Yu wrote: > Have you used HFile tool ? > > The tool would print out that information as part of metadata. > > Cheers > > On Sun, Sep 27, 2015 at 9:19 AM, Håvard Wahl Kongsgård > wrote: >> >> Yes, I have tried to read them on another system as well. It worked >> there. But I don't know if they are HFilev1 or HFilev2 format(any way >> to check ?? ) >> >> This is the first lines from one of the files >> >> SEQ >> 1org.apache.hadoop.hbase.io.ImmutableBytesWritable%org.apache.hadoop.hbase.client.Result >> *org.apache.hadoop.io.compress.DefaultCodec���N�� >> $��a&t ��wb%!10107712083-10152358443612846x��� P ]�6:� w |pw >> ��� �$��K 0� Npw�$x�@� ���s�;�Ɦ���V�^��ݽW� �� � >> �қ ��<� /��0/ � ?'/ � >> � /��� �� 7{kG [(��� ���w�� OY^I ���}9 � �l��;�TJ�� �� �J� ‹ >> pu���V� ӡm�\E @ ��V6�oe45U ���,�3 ���Ͻ�w��O���zڼ�/��歇�KȦ/ ?�� Y;� >> / ��� �� �� }� ��룫-�'_�k� ��q� $ ��˨� � ���^ >> ��� i��� tH$/��e.J��{S �\��S >G d���1~ p#�� o �� ��M >> �!٠��;c��I kQ >> �A)|d�i�(Z�f��o Pb �j {� �x��� � `�b���cbb`�"�} � >> HCG��&�JG�%��',*!!�� >> �� � � ��& �_Q��R�2�1��_��~>:� b � ���w @�B� ~Y�H�(�h/FR >> _+��nX `#� >> |D��� �j���� f ��ƨT��k/ 颚h ��4` +Q#�ⵕ�,Z�80�V:� >> )Y)4Lq��[� z#���T> >> -Håvard >> >> On Sun, Sep 27, 2015 at 4:06 PM, Ted Yu wrote: >> > Have you verified that the files to be imported are in HFilev2 format ? >> > >> > http://hbase.apache.org/book.html#_hfile_tool >> > >> > Cheers >> > >> > On Sun, Sep 27, 2015 at 4:47 AM, Håvard Wahl Kongsgård >> > wrote: >> >> >> >> >Is the single node system secure ? >> >> >> >> No have not activated, just defaults >> >> >> >> the mapred conf. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> mapred.job.tracker >> >> >> >> rack3:8021 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> mapred.jobtracker.plugins >> >> >> >> org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin >> >> >> >> Comma-separated list of jobtracker plug-ins to be >> >> activated. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> jobtracker.thrift.address >> >> >> >> 0.0.0.0:9290 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>Have you checked hdfs healthiness ? >> >> >> >> >> >> sudo -u hdfs hdfs dfsadmin -report >> >> >> >> Configured Capacity: 2876708585472 (2.62 TB) >> >> >> >> Present Capacity: 1991514849280 (1.81 TB) >> >> >> >> DFS Remaining: 1648230617088 (1.50 TB) >> >> >> >> DFS Used: 343284232192 (319.71 GB) >> >> >> >> DFS Used%: 17.24% >> >> >> >> Under replicated blocks: 52 >> >> >> >> Blocks with corrupt replicas: 0 >> >> >> >> Missing blocks: 0 >> >> >> >> >> >> - >> >> >> >> Datanodes available: 1 (1 total, 0 dead) >> >> >> >> >> >> Live datanodes: >> >> >> >> Name: 127.0.0.1:50010 (localhost) >> >> >> >> Hostname: rack3 >> >> >> >> Decommission Status : Normal >> >> >> >> Configured Capacity: 2876708585472 (2.62 TB) >> >> >> >> DFS Used: 343284232192 (319.71 GB) >> >> >> >> Non DFS Used: 885193736192 (824.40 GB) >> >> >> >> DFS Remaining: 1648230617088 (1.50 TB) >> >> >> >> DFS Used%: 11.93% >> >> >> >> DFS Remaining%: 57.30% >> >> >> >> Last contact: Sun Sep 27 13:44:45 CEST 2015 >> >> >> >> >> >> >>To which release of hbase were you importing ? >> >> >> &
Re: Error importing hbase table on new system
Yes, I have tried to read them on another system as well. It worked there. But I don't know if they are HFilev1 or HFilev2 format(any way to check ?? ) This is the first lines from one of the files SEQ1org.apache.hadoop.hbase.io.ImmutableBytesWritable%org.apache.hadoop.hbase.client.Result*org.apache.hadoop.io.compress.DefaultCodec���N�� $��a&t��wb%!10107712083-10152358443612846x���P]�6:�w|pw ��� �$��K 0� Npw�$x�@����s�;�Ɦ���V�^��ݽW���� �қ��<�/��0/�?'/� �/�����7{kG[(������w��OY^I���}9��l��;�TJ�����J�pu���V�ӡm�\E@��V6�oe45U���,�3���Ͻ�w��O���zڼ�/��歇�KȦ/?��Y;�/�������}���룫-�'_�k���q�$��˨�����^ ���i���tH$/��e.J��{S�\��S>Gd���1~p#��o����M �!٠��;c��IkQ �A)|d�i�(Z�f��oPb�j{��x����`�b���cbb`�"�}�HCG��&�JG�%��',*!!�� ������&�_Q��R�2�1��_��~>:�b����w@�B�~Y�H�(�h/FR_+��nX`#� |D����j����f��ƨT��k/颚h��4`+Q#�ⵕ�,Z�80�V:� )Y)4Lq��[�z#���T wrote: > Have you verified that the files to be imported are in HFilev2 format ? > > http://hbase.apache.org/book.html#_hfile_tool > > Cheers > > On Sun, Sep 27, 2015 at 4:47 AM, Håvard Wahl Kongsgård > wrote: >> >> >Is the single node system secure ? >> >> No have not activated, just defaults >> >> the mapred conf. >> >> >> >> >> >> >> >> >> >> >> mapred.job.tracker >> >> rack3:8021 >> >> >> >> >> >> >> >> >> mapred.jobtracker.plugins >> >> org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin >> >> Comma-separated list of jobtracker plug-ins to be >> activated. >> >> >> >> >> >> >> >> jobtracker.thrift.address >> >> 0.0.0.0:9290 >> >> >> >> >> >> >> >>Have you checked hdfs healthiness ? >> >> >> sudo -u hdfs hdfs dfsadmin -report >> >> Configured Capacity: 2876708585472 (2.62 TB) >> >> Present Capacity: 1991514849280 (1.81 TB) >> >> DFS Remaining: 1648230617088 (1.50 TB) >> >> DFS Used: 343284232192 (319.71 GB) >> >> DFS Used%: 17.24% >> >> Under replicated blocks: 52 >> >> Blocks with corrupt replicas: 0 >> >> Missing blocks: 0 >> >> >> - >> >> Datanodes available: 1 (1 total, 0 dead) >> >> >> Live datanodes: >> >> Name: 127.0.0.1:50010 (localhost) >> >> Hostname: rack3 >> >> Decommission Status : Normal >> >> Configured Capacity: 2876708585472 (2.62 TB) >> >> DFS Used: 343284232192 (319.71 GB) >> >> Non DFS Used: 885193736192 (824.40 GB) >> >> DFS Remaining: 1648230617088 (1.50 TB) >> >> DFS Used%: 11.93% >> >> DFS Remaining%: 57.30% >> >> Last contact: Sun Sep 27 13:44:45 CEST 2015 >> >> >> >>To which release of hbase were you importing ? >> >> Hbase 0.94 (CHD 4) >> >> the new one is CHD 5.4 >> >> On Sun, Sep 27, 2015 at 1:32 PM, Ted Yu wrote: >> > Is the single node system secure ? >> > Have you checked hdfs healthiness ? >> > To which release of hbase were you importing ? >> > >> > Thanks >> > >> >> On Sep 27, 2015, at 3:06 AM, Håvard Wahl Kongsgård >> >> wrote: >> >> >> >> Hi, Iam trying to import a old backup to a new smaller system (just >> >> single node, to get the data out) >> >> >> >> when I use >> >> >> >> sudo -u hbase hbase -Dhbase.import.version=0.94 >> >> org.apache.hadoop.hbase.mapreduce.Import crawler >> >> /crawler_hbase/crawler >> >> >> >> I get this error in the tasks . Is this a permission problem? >> >> >> >> >> >> 2015-09-26 23:56:32,995 ERROR >> >> org.apache.hadoop.security.UserGroupInformation: >> >> PriviledgedActionException as:mapred (auth:SIMPLE) >> >> cause:java.io.IOException: keyvalues=NONE read 4096 bytes, should read >> >> 14279 >> >> 2015-09-26 23:56:32,996 WARN org.apache.hadoop.mapred.Child: Error >> >> running child >> >> java.io.IOException: keyvalues=NONE read 4096 bytes, should read 14279 >> >> at >> >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2221) >> >>
Re: Error importing hbase table on new system
>Is the single node system secure ? No have not activated, just defaults the mapred conf. mapred.job.tracker rack3:8021 mapred.jobtracker.plugins org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin Comma-separated list of jobtracker plug-ins to be activated. jobtracker.thrift.address 0.0.0.0:9290 >>Have you checked hdfs healthiness ? sudo -u hdfs hdfs dfsadmin -report Configured Capacity: 2876708585472 (2.62 TB) Present Capacity: 1991514849280 (1.81 TB) DFS Remaining: 1648230617088 (1.50 TB) DFS Used: 343284232192 (319.71 GB) DFS Used%: 17.24% Under replicated blocks: 52 Blocks with corrupt replicas: 0 Missing blocks: 0 - Datanodes available: 1 (1 total, 0 dead) Live datanodes: Name: 127.0.0.1:50010 (localhost) Hostname: rack3 Decommission Status : Normal Configured Capacity: 2876708585472 (2.62 TB) DFS Used: 343284232192 (319.71 GB) Non DFS Used: 885193736192 (824.40 GB) DFS Remaining: 1648230617088 (1.50 TB) DFS Used%: 11.93% DFS Remaining%: 57.30% Last contact: Sun Sep 27 13:44:45 CEST 2015 >>To which release of hbase were you importing ? Hbase 0.94 (CHD 4) the new one is CHD 5.4 On Sun, Sep 27, 2015 at 1:32 PM, Ted Yu wrote: > Is the single node system secure ? > Have you checked hdfs healthiness ? > To which release of hbase were you importing ? > > Thanks > >> On Sep 27, 2015, at 3:06 AM, Håvard Wahl Kongsgård >> wrote: >> >> Hi, Iam trying to import a old backup to a new smaller system (just >> single node, to get the data out) >> >> when I use >> >> sudo -u hbase hbase -Dhbase.import.version=0.94 >> org.apache.hadoop.hbase.mapreduce.Import crawler >> /crawler_hbase/crawler >> >> I get this error in the tasks . Is this a permission problem? >> >> >> 2015-09-26 23:56:32,995 ERROR >> org.apache.hadoop.security.UserGroupInformation: >> PriviledgedActionException as:mapred (auth:SIMPLE) >> cause:java.io.IOException: keyvalues=NONE read 4096 bytes, should read >> 14279 >> 2015-09-26 23:56:32,996 WARN org.apache.hadoop.mapred.Child: Error running >> child >> java.io.IOException: keyvalues=NONE read 4096 bytes, should read 14279 >> at >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2221) >> at >> org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:74) >> at >> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) >> at >> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) >> at >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) >> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:415) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) >> at org.apache.hadoop.mapred.Child.main(Child.java:262) >> 2015-09-26 23:56:33,002 INFO org.apache.hadoop.mapred.Task: Runnning >> cleanup for the task >> >> >> >> -- >> Håvard Wahl Kongsgård >> Data Scientist -- Håvard Wahl Kongsgård Data Scientist
Error importing hbase table on new system
Hi, Iam trying to import a old backup to a new smaller system (just single node, to get the data out) when I use sudo -u hbase hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import crawler /crawler_hbase/crawler I get this error in the tasks . Is this a permission problem? 2015-09-26 23:56:32,995 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:mapred (auth:SIMPLE) cause:java.io.IOException: keyvalues=NONE read 4096 bytes, should read 14279 2015-09-26 23:56:32,996 WARN org.apache.hadoop.mapred.Child: Error running child java.io.IOException: keyvalues=NONE read 4096 bytes, should read 14279 at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2221) at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:74) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.Child.main(Child.java:262) 2015-09-26 23:56:33,002 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task -- Håvard Wahl Kongsgård Data Scientist
Re: How can I add a new hard disk in an existing HDFS cluster?
go for ext3 or ext4 On Fri, May 3, 2013 at 8:32 AM, Joarder KAMAL wrote: > Hi, > > I have a running HDFS cluster (Hadoop/HBase) consists of 4 nodes and the > initial hard disk (/dev/vda1) size is 10G only. Now I have a second hard > drive /dev/vdb of 60GB size and want to add it into my existing HDFS > cluster. How can I format the new hard disk (and in which format? XFS?) and > mount it to work with HDFS > > Default HDFS directory is situated in > /usr/local/hadoop-1.0.4/hadoop-datastore > And I followed this link for installation. > > http://ankitasblogger.blogspot.com.au/2011/01/hadoop-cluster-setup.html > > Many thanks in advance :) > > > Regards, > Joarder Kamal > -- Håvard Wahl Kongsgård Data Scientist Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.dbkeeping.com/
Re: R environment with Hadoop
Hi, simpler is always better ... ...for example if you use hadoop with java http://www.rforge.net/rJava/ ... if you use hadoop with python(pydoop,dumbo) http://rpy.sourceforge.net/rpy2/doc-2.0/html/index.html On Wed, Apr 10, 2013 at 8:49 PM, Shah, Rahul1 wrote: > Hi, > > > > I have to find out whether there is R environment that can be run on Hadoop. > I see several packages of R and Hadoop. Any pointer which is good one to > use. How can I learn R and start on with it. > > > > -Rahul > > -- Håvard Wahl Kongsgård Data Scientist Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.dbkeeping.com/
Re: Which hadoop installation should I use on ubuntu server?
I recommend cloudera's CDH4 on ubuntu 12.04 LTS On Thu, Mar 28, 2013 at 7:07 AM, David Parks wrote: > I’m moving off AWS MapReduce to our own cluster, I’m installing Hadoop on > Ubuntu Server 12.10. > > ** ** > > I see a .deb installer and installed that, but it seems like files are all > over the place `/usr/share/Hadoop`, `/etc/hadoop`, `/usr/bin/hadoop`. And > the documentation is a bit harder to follow: > > ** ** > > http://hadoop.apache.org/docs/r1.1.2/cluster_setup.html > > ** ** > > So I just wonder if this installer is the best approach, or if it’ll be > easier/better to just install the basic build in /opt/hadoop and perhaps > the docs become easier to follow. Thoughts? > > ** ** > > Thanks, > > Dave > > ** ** > -- Håvard Wahl Kongsgård Data Scientist Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.dbkeeping.com/
Re: Hadoop cluster hangs on big hive job
Dude I'am not going to read all you log files, but try to run this as a normal map reduce job, it could be memory related, something wrong with some of the zip files, wrong config etc. -Håvard On Thu, Mar 7, 2013 at 8:53 PM, Daning Wang wrote: > We have hive query processing zipped csv files. the query was scanning for > 10 days(partitioned by date). data for each day around 130G. The problem is > not consistent since if you run it again, it might go through. but the > problem has never happened on the smaller jobs(like processing only one days > data). > > We don't have space issue. > > I have attached log file when problem happening. it is stuck like > following(just search "19706 of 49964") > > 2013-03-05 15:13:51,587 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of 49964 > at 0.00 MB/s) > > 2013-03-05 15:13:51,811 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of 49964 > at 0.00 MB/s) > > 2013-03-05 15:13:52,551 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of 49964 > at 0.00 MB/s) > > 2013-03-05 15:13:52,760 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of 49964 > at 0.00 MB/s) > > 2013-03-05 15:13:52,946 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of 49964 > at 0.00 MB/s) > > 2013-03-05 15:13:54,742 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of 49964 > at 0.00 MB/s) > > > Thanks, > > Daning > > > On Thu, Mar 7, 2013 at 12:21 AM, Håvard Wahl Kongsgård > wrote: >> >> hadoop logs? >> >> On 6. mars 2013 21:04, "Daning Wang" wrote: >>> >>> We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while >>> running big jobs. Basically all the nodes are dead, from that trasktracker's >>> log looks it went into some kinds of loop forever. >>> >>> All the log entries like this when problem happened. >>> >>> Any idea how to debug the issue? >>> >>> Thanks in advance. >>> >>> >>> 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_08_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_04_0 0.131468% reduce > copy (19706 of 49964 >>> at 0.00 MB/s) >
Re: Hadoop cluster hangs on big hive job
hadoop logs? On 6. mars 2013 21:04, "Daning Wang" wrote: > We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while > running big jobs. Basically all the nodes are dead, from that > trasktracker's log looks it went into some kinds of loop forever. > > All the log entries like this when problem happened. > > Any idea how to debug the issue? > > Thanks in advance. > > > 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_08_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_04_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_43_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_43_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_08_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_04_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > > 2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_00_0 0.131468% reduce > copy
Re: Skipping entire task
Thanks, I was unaware of mapred.max.map.failures.percent -Håvard On Sun, Jan 6, 2013 at 3:46 PM, Harsh J wrote: > You can use the mapred.max.map.failures.percent and > mapred.max.reduce.failures.percent features to control the percentage > of allowed failures of tasks in a single job (despite which the job is > marked successful). > > On Sun, Jan 6, 2013 at 8:04 PM, Håvard Wahl Kongsgård > wrote: >>> Are tasks being executed multiple times due to failures? Sorry, it was not >>> very clear from your question. >> >> yes, and I simply want to skip them if they fail more than x >> times(after all this is big data :) ). >> >> -Håvard >> >> On Sun, Jan 6, 2013 at 3:01 PM, Hemanth Yamijala >> wrote: >>> Hi, >>> >>> Are tasks being executed multiple times due to failures? Sorry, it was not >>> very clear from your question. >>> >>> Thanks >>> hemanth >>> >>> >>> On Sat, Jan 5, 2013 at 7:44 PM, David Parks wrote: >>>> >>>> Thinking here... if you submitted the task programmatically you should be >>>> able to capture the failure of the task and gracefully move past it to >>>> your >>>> next tasks. >>>> >>>> To say it in a long-winded way: Let's say you submit a job to Hadoop, a >>>> java jar, and your main class implements Tool. That code has the >>>> responsibility to submit a series of jobs to hadoop, something like this: >>>> >>>> try{ >>>> Job myJob = new MyJob(getConf()); >>>> myJob.submitAndWait(); >>>> }catch(Exception uhhohh){ >>>> //Deal with the issue and move on >>>> } >>>> Job myNextJob = new MyNextJob(getConf()); >>>> myNextJob.submit(); >>>> >>>> Just pseudo code there to demonstrate my thought. >>>> >>>> David >>>> >>>> >>>> >>>> -Original Message- >>>> From: Håvard Wahl Kongsgård [mailto:haavard.kongsga...@gmail.com] >>>> Sent: Saturday, January 05, 2013 4:54 PM >>>> To: user >>>> Subject: Skipping entire task >>>> >>>> Hi, hadoop can skip bad records >>>> >>>> http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-c >>>> ode. >>>> But it is also possible to skip entire tasks? >>>> >>>> -Håvard >>>> >>>> -- >>>> Håvard Wahl Kongsgård >>>> Faculty of Medicine & >>>> Department of Mathematical Sciences >>>> NTNU >>>> >>>> http://havard.security-review.net/ >>>> >>> >> >> >> >> -- >> Håvard Wahl Kongsgård >> Faculty of Medicine & >> Department of Mathematical Sciences >> NTNU >> >> http://havard.security-review.net/ > > > > -- > Harsh J -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: Skipping entire task
> Are tasks being executed multiple times due to failures? Sorry, it was not > very clear from your question. yes, and I simply want to skip them if they fail more than x times(after all this is big data :) ). -Håvard On Sun, Jan 6, 2013 at 3:01 PM, Hemanth Yamijala wrote: > Hi, > > Are tasks being executed multiple times due to failures? Sorry, it was not > very clear from your question. > > Thanks > hemanth > > > On Sat, Jan 5, 2013 at 7:44 PM, David Parks wrote: >> >> Thinking here... if you submitted the task programmatically you should be >> able to capture the failure of the task and gracefully move past it to >> your >> next tasks. >> >> To say it in a long-winded way: Let's say you submit a job to Hadoop, a >> java jar, and your main class implements Tool. That code has the >> responsibility to submit a series of jobs to hadoop, something like this: >> >> try{ >> Job myJob = new MyJob(getConf()); >> myJob.submitAndWait(); >> }catch(Exception uhhohh){ >> //Deal with the issue and move on >> } >> Job myNextJob = new MyNextJob(getConf()); >> myNextJob.submit(); >> >> Just pseudo code there to demonstrate my thought. >> >> David >> >> >> >> -Original Message- >> From: Håvard Wahl Kongsgård [mailto:haavard.kongsga...@gmail.com] >> Sent: Saturday, January 05, 2013 4:54 PM >> To: user >> Subject: Skipping entire task >> >> Hi, hadoop can skip bad records >> >> http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-c >> ode. >> But it is also possible to skip entire tasks? >> >> -Håvard >> >> -- >> Håvard Wahl Kongsgård >> Faculty of Medicine & >> Department of Mathematical Sciences >> NTNU >> >> http://havard.security-review.net/ >> > -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: Skipping entire task
yes, but I use pydoop not the native java library. The problem is that the same task times, so a solution is not that straightforward. And Pydoop does not seem to have any methods to inform the task how many times it has failed. So if there is no native method in hadoop, I could use a database or something for that purpose. Any other ideas? -Håvard On Sat, Jan 5, 2013 at 3:14 PM, David Parks wrote: > Thinking here... if you submitted the task programmatically you should be > able to capture the failure of the task and gracefully move past it to your > next tasks. > > To say it in a long-winded way: Let's say you submit a job to Hadoop, a > java jar, and your main class implements Tool. That code has the > responsibility to submit a series of jobs to hadoop, something like this: > > try{ > Job myJob = new MyJob(getConf()); > myJob.submitAndWait(); > }catch(Exception uhhohh){ > //Deal with the issue and move on > } > Job myNextJob = new MyNextJob(getConf()); > myNextJob.submit(); > > Just pseudo code there to demonstrate my thought. > > David > > > > -Original Message- > From: Håvard Wahl Kongsgård [mailto:haavard.kongsga...@gmail.com] > Sent: Saturday, January 05, 2013 4:54 PM > To: user > Subject: Skipping entire task > > Hi, hadoop can skip bad records > http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-c > ode. > But it is also possible to skip entire tasks? > > -Håvard > > -- > Håvard Wahl Kongsgård > Faculty of Medicine & > Department of Mathematical Sciences > NTNU > > http://havard.security-review.net/ > -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Skipping entire task
Hi, hadoop can skip bad records http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code. But it is also possible to skip entire tasks? -Håvard -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: Hadoop in Pseudo-Distributed mode on Mac OS X 10.8
Pseudo-distributed mode is good for developing and testing hadoop code. But instead of experimenting with hadoop on your mac, I would go for hadoop on EC2. With starcluster http://web.mit.edu/star/cluster/ it takes just a single command to start hadoop. You also get a fixed environment. -Håvard On Mon, Aug 13, 2012 at 6:21 AM, Subho Banerjee wrote: > Hello, > > I am running hadoop v1.0.3 in Mac OS X 10.8 with Java_1.6.0_33-b03-424 > > > When running hadoop on pseudo-distributed mode, the map seems to work, but > it cannot compute the reduce. > > 12/08/13 08:58:12 INFO mapred.JobClient: Running job: job_201208130857_0001 > 12/08/13 08:58:13 INFO mapred.JobClient: map 0% reduce 0% > 12/08/13 08:58:27 INFO mapred.JobClient: map 20% reduce 0% > 12/08/13 08:58:33 INFO mapred.JobClient: map 30% reduce 0% > 12/08/13 08:58:36 INFO mapred.JobClient: map 40% reduce 0% > 12/08/13 08:58:39 INFO mapred.JobClient: map 50% reduce 0% > 12/08/13 08:58:42 INFO mapred.JobClient: map 60% reduce 0% > 12/08/13 08:58:45 INFO mapred.JobClient: map 70% reduce 0% > 12/08/13 08:58:48 INFO mapred.JobClient: map 80% reduce 0% > 12/08/13 08:58:51 INFO mapred.JobClient: map 90% reduce 0% > 12/08/13 08:58:54 INFO mapred.JobClient: map 100% reduce 0% > 12/08/13 08:59:14 INFO mapred.JobClient: Task Id : > attempt_201208130857_0001_m_00_0, Status : FAILED > Too many fetch-failures > 12/08/13 08:59:14 WARN mapred.JobClient: Error reading task outputServer > returned HTTP response code: 403 for URL: > http://10.1.66.17:50060/tasklog?plaintext=true&attemptid=attempt_201208130857_0001_m_00_0&filter=stdout > 12/08/13 08:59:14 WARN mapred.JobClient: Error reading task outputServer > returned HTTP response code: 403 for URL: > http://10.1.66.17:50060/tasklog?plaintext=true&attemptid=attempt_201208130857_0001_m_00_0&filter=stderr > 12/08/13 08:59:18 INFO mapred.JobClient: map 89% reduce 0% > 12/08/13 08:59:21 INFO mapred.JobClient: map 100% reduce 0% > 12/08/13 09:00:14 INFO mapred.JobClient: Task Id : > attempt_201208130857_0001_m_01_0, Status : FAILED > Too many fetch-failures > > Here is what I get when I try to see the tasklog using the links given in > the output > > http://10.1.66.17:50060/tasklog?plaintext=true&attemptid=attempt_201208130857_0001_m_00_0&filter=stderr > ---> > 2012-08-13 08:58:39.189 java[74092:1203] Unable to load realm info from > SCDynamicStore > > http://10.1.66.17:50060/tasklog?plaintext=true&attemptid=attempt_201208130857_0001_m_00_0&filter=stdout > ---> > > I have changed my hadoop-env.sh acoording to Mathew Buckett in > https://issues.apache.org/jira/browse/HADOOP-7489 > > Also this error of Unable to load realm info from SCDynamicStore does not > show up when I do 'hadoop namenode -format' or 'start-all.sh' > > I am also attaching a zipped copy of my logs > > > Cheers, > > Subho. -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: example usage of s3 file system
see also http://wiki.apache.org/hadoop/AmazonS3 On Tue, Aug 28, 2012 at 9:14 AM, Chris Collins wrote: > Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my > tinkering I am not having a great deal of success. I am particularly > interested in the ability to mimic a directory structure (since s3 native > doesnt do it). > > Can anyone point me to some good example usage of Hadoop FileSystem with s3? > > I created a few directories using transit and AWS S3 console for test. Doing > a liststatus of the bucket returns a FileStatus object of the directory > created but if I try to do a liststatus of that path I am getting a 404: > > org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: > Request Error. HEAD '/' on Host > > Probably not the best list to look for help, any clues appreciated. > > C -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: Using Multiple Input format with Hadoop Streaming
dude preprocess the data! -Håvard On Sat, Aug 25, 2012 at 7:48 AM, Siddharth Tiwari wrote: > > How shall I give multple Inputsand keep multiple mappers in Streaming. How > shall I map each output to specific mapper in Streaming. > > ** > Cheers !!! > Siddharth Tiwari > Have a refreshing day !!! > "Every duty is holy, and devotion to duty is the highest form of worship of > God.” > "Maybe other people will try to limit me but I don't limit myself" -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: secondary namenode storage location
https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster#CDH3DeploymentonaCluster-ConfiguringtheSecondaryNameNode fs.checkpoint.dir/cache/hadoop/dfs fs.checkpoint.edits.dir /cache/hadoop/dfs -Håvard On Fri, Aug 24, 2012 at 4:20 PM, Abhay Ratnaparkhi wrote: > Hello Everyone, > > I have specified the secondary namenode is masters file. > Which property is to be used to show a path used to store HDFS data by > secondary node? > What happened if I don't specify that property (or What is the default > location secondary namenode uses)? > > Regards, > Abhay > > -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: namenode not starting
You should start with a reboot of the system. A lesson to everyone, this is exactly why you should have a secondary name node (http://wiki.apache.org/hadoop/FAQ#What_is_the_purpose_of_the_secondary_name-node.3F) and run the namenode a mirrored RAID-5/10 disk. -Håvard On Fri, Aug 24, 2012 at 9:40 AM, Abhay Ratnaparkhi wrote: > Hello, > > I was using cluster for long time and not formatted the namenode. > I ran bin/stop-all.sh and bin/start-all.sh scripts only. > > I am using NFS for dfs.name.dir. > hadoop.tmp.dir is a /tmp directory. I've not restarted the OS. Any way to > recover the data? > > Thanks, > Abhay > > > On Fri, Aug 24, 2012 at 1:01 PM, Bejoy KS wrote: >> >> Hi Abhay >> >> What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set to >> /tmp the contents would be deleted on a OS restart. You need to change this >> location before you start your NN. >> Regards >> Bejoy KS >> >> Sent from handheld, please excuse typos. >> >> From: Abhay Ratnaparkhi >> Date: Fri, 24 Aug 2012 12:58:41 +0530 >> To: >> ReplyTo: user@hadoop.apache.org >> Subject: namenode not starting >> >> Hello, >> >> I had a running hadoop cluster. >> I restarted it and after that namenode is unable to start. I am getting >> error saying that it's not formatted. :( >> Is it possible to recover the data on HDFS? >> >> 2012-08-24 03:17:55,378 ERROR >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem >> initialization failed. >> java.io.IOException: NameNode is not formatted. >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:270) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:433) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:421) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) >> 2012-08-24 03:17:55,380 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: >> NameNode is not formatted. >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:270) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:433) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:421) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) >> >> Regards, >> Abhay >> >> > -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: Install Hive and Pig
you also have CDH3, https://ccp.cloudera.com/display/CDHDOC/Pig+Installation https://ccp.cloudera.com/display/CDHDOC/Hive+Installation -Håvard On Thu, Aug 23, 2012 at 5:57 PM, rajesh bathala wrote: > Hi Friends, > > I am new to Hadoop. Can you please let us know how to install Hive and Pig? > > Thank you in advance. > > Thanks > Rajesh > -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: Hadoop on EC2 Managing Internal/External IPs
Hi, a vpn or simply first uploading the files to an ec2 node is the best option but an alternative is to use the external interface/IP instead of the internal in the hadoop config¸ I assume this will be slower and more costly... -Håvard On Fri, Aug 24, 2012 at 4:54 AM, igor Finkelshteyn wrote: > I've seen a bunch of people with this exact same question all over Google > with no answers. I know people have successful non-temporary clusters in EC2. > Is there really no one that's needed to deal with having EC2 expose external > addresses instead of internal addresses before? This seems like it should be > a common thing. > > On Aug 23, 2012, at 12:34 PM, igor Finkelshteyn wrote: > >> Hi, >> I'm currently setting up a Hadoop cluster on EC2, and everything works just >> fine when accessing the cluster from inside EC2, but as soon as I try to do >> something like upload a file from an external client, I get timeout errors >> like: >> >> 12/08/23 12:06:16 ERROR hdfs.DFSClient: Failed to close file >> /user/some_file._COPYING_ >> java.net.SocketTimeoutException: 65000 millis timeout while waiting for >> channel to be ready for connect. ch : >> java.nio.channels.SocketChannel[connection-pending remote=/10.123.x.x:50010] >> >> What's clearly happening is my NameNode is resolving my DataNode's IPs to >> their internal EC2 values instead of their external values, and then sending >> along the internal IP to my external client, which is obviously unable to >> reach those. I'm thinking this must be a common problem. How do other people >> deal with it? Is there a way to just force my name node to send along my >> DataNode's hostname instead of IP, so that the hostname can be resolved >> properly from whatever box will be sending files? >> >> Eli > -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: Reading multiple lines from a microsoft doc in hadoop
Hi, maybe you should check out the old nutch project http://nutch.apache.org/ (hadoop was developed for nutch). It's a web crawler and indexer, but the malinglists hold much info doc/pdf parsing which also relates to hadoop. Have never parsed many docx or doc files, but it should be strait-forward. But generally for text analysis preprocessing is the KEY! For example replace dual lines \r\n\r\n or (\n\n) with is a simple trick) -Håvard On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari wrote: > Hi, > Thank you for the suggestion. Actually I was using poi to extract text, but > since now I have so many documents I thought I will use hadoop directly > to parse as well. Average size of each document is around 120 kb. Also I > want to read multiple lines from the text until I find a blank line. I do > not have any idea ankit how to design custom input format and record reader. > Pleaser help with some tutorial tutorial, code or resource around it. I am > struggling with the issue. I will be highly grateful. Thank you so much once > again > >> Date: Fri, 24 Aug 2012 08:07:39 +0200 >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop >> From: haavard.kongsga...@gmail.com >> To: user@hadoop.apache.org > >> >> It's much easier if you convert the documents to text first >> >> use >> http://tika.apache.org/ >> >> or some other doc parser >> >> >> -Håvard >> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari >> wrote: >> > hi, >> > I have doc files in msword doc and docx format. These have entries which >> > are >> > seperated by an empty line. Is it possible for me to read >> > these lines separated from empty lines at a time. Also which inpurformat >> > shall I use to read doc docx. Please help >> > >> > ** >> > Cheers !!! >> > Siddharth Tiwari >> > Have a refreshing day !!! >> > "Every duty is holy, and devotion to duty is the highest form of worship >> > of >> > God.” >> > "Maybe other people will try to limit me but I don't limit myself" >> >> >> >> -- >> Håvard Wahl Kongsgård >> Faculty of Medicine & >> Department of Mathematical Sciences >> NTNU >> >> http://havard.security-review.net/ -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: Reading multiple lines from a microsoft doc in hadoop
It's much easier if you convert the documents to text first use http://tika.apache.org/ or some other doc parser -Håvard On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari wrote: > hi, > I have doc files in msword doc and docx format. These have entries which are > seperated by an empty line. Is it possible for me to read > these lines separated from empty lines at a time. Also which inpurformat > shall I use to read doc docx. Please help > > ** > Cheers !!! > Siddharth Tiwari > Have a refreshing day !!! > "Every duty is holy, and devotion to duty is the highest form of worship of > God.” > "Maybe other people will try to limit me but I don't limit myself" -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: pipes(pydoop) and hbase classpath
however, when run hadoop pipes -conf myconf_job.conf -input name_of_table -output /tmp/out I don't get any error, hadoop just stalls with 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.5-cdh3u4--1, built on 05/07/2012 21:08 GMT 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:host.name=kongs1 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_31 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/java-6-sun-1.6.0.31/jre 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/usr/lib/hadoop-0.20/conf:/usr/lib/jvm/java-6-sun//lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u4.jar:/usr/lib/hadoop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt-1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/hadoop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.1.jar:/usr/lib/hadoop-0.20/lib/commons-lang-2.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-3.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/guava-r09-jarjar.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u4.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jetty-servlet-tester-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar:/usr/lib/hbase/hbase-0.90.6-cdh3u4.jar:/usr/lib/zookeeper/zookeeper-3.3.5-cdh3u4.jar 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:java.compiler= 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-41-server 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:user.name=hdfs 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:user.home=/usr/lib/hadoop-0.20 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/havard/d/graph 12/08/15 11:27:54 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=18 watcher=hconnection 12/08/15 11:27:54 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 12/08/15 11:27:54 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 12/08/15 11:27:54 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x139266be8b90004, negotiated timeout = 4 -Håvard On Wed, Aug 15, 2012 at 10:01 AM, Håvard Wahl Kongsgård wrote: > Hi, needed to add this as well > > > > hbase.mapred.tablecolumns > col_fam:name > > > -Håvard > > > On Wed, Aug 15, 2012 at 9:42 AM, Håvard Wahl Kongsgård > wrote: >> Hi, my job config is >> >> >> mapred.input.format.class >> org.apache.hadoop.hbase.mapred.TableInputFormat >> >> >> >> hadoop.pipes.java.recordreader >> true >> >> >> >> Exception in thread "main" java.lang.RuntimeException: Error in >> configuring object >> at >> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) >> at >> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) >> at
Re: pipes(pydoop) and hbase classpath
Hi, needed to add this as well hbase.mapred.tablecolumns col_fam:name -Håvard On Wed, Aug 15, 2012 at 9:42 AM, Håvard Wahl Kongsgård wrote: > Hi, my job config is > > > mapred.input.format.class > org.apache.hadoop.hbase.mapred.TableInputFormat > > > > hadoop.pipes.java.recordreader > true > > > > Exception in thread "main" java.lang.RuntimeException: Error in > configuring object > at > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) > at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:596) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:977) > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:969) > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1248) > at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248) > at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479) > at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) > ... 17 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hbase.mapred.TableInputFormat.configure(TableInputFormat.java:51) > > > should I included the col names? according to the api it's deprecated? > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html > > > -Håvard > > > On Tue, Aug 14, 2012 at 11:17 PM, Harsh J wrote: >> Hi, >> >> Per: >> >>> org.apache.hadoop.hbase.mapreduce.TableInputFormat not >> org.apache.hadoop.mapred.InputFormat >> >> Pydoop seems to be expecting you to pass it an old API class for >> InputFormat/etc. but you've passed in the newer class. I am unsure >> what part of your code exactly may be at fault since I do not have >> access to it, but you probably want to use the deprecated >> org.apache.hadoop.hbase.mapred.* package classes such as >> org.apache.hadoop.hbase.mapred.TableInputFormat, and not the >> org.apache.hadoop.hbase.mapreduce.* classes, as you are using at the >> moment. >> >> HTH! >> >> On Wed, Aug 15, 2012 at 2:39 AM, Håvard Wahl Kongsgård >> wrote: >>> Hi, I'am trying to read hbase key-values with pipes(pydoop). As hadoop >>> is unable to find the hbase jar files. I get >>> >>> Exception in thread "main" java.lang.RuntimeException: >>> java.lang.RuntimeException: class >>> org.apache.hadoop.hbase.mapreduce.TableInputFormat not >>> org.apache.hadoop.mapred.InputFormat >>> >>> have added export >>> HADOOP_CLASSPATH=/usr/lib/hbase/hbase-0.90.6-cdh3u4.jar to my >>> hadoop-env.sh >>> >>> According to the doc from cloudera, >>> https://ccp.cloudera.com/display/CDHDOC/HBase+Installation#HBaseInstallation-UsingMapReducewithHBase >>> TableMapReduceUtil.addDependencyJars(job); can be used as an >>> alternative. But is that possible with pipes? >>> >>> -Håvard >> >> >> >> -- >> Harsh J > > > > -- > Håvard Wahl Kongsgård > Faculty of Medicine & > Department of Mathematical Sciences > NTNU > > http://havard.security-review.net/
Re: pipes(pydoop) and hbase classpath
Hi, my job config is mapred.input.format.class org.apache.hadoop.hbase.mapred.TableInputFormat hadoop.pipes.java.recordreader true Exception in thread "main" java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:596) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:977) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:969) at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1248) at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248) at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479) at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 17 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hbase.mapred.TableInputFormat.configure(TableInputFormat.java:51) should I included the col names? according to the api it's deprecated? http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html -Håvard On Tue, Aug 14, 2012 at 11:17 PM, Harsh J wrote: > Hi, > > Per: > >> org.apache.hadoop.hbase.mapreduce.TableInputFormat not > org.apache.hadoop.mapred.InputFormat > > Pydoop seems to be expecting you to pass it an old API class for > InputFormat/etc. but you've passed in the newer class. I am unsure > what part of your code exactly may be at fault since I do not have > access to it, but you probably want to use the deprecated > org.apache.hadoop.hbase.mapred.* package classes such as > org.apache.hadoop.hbase.mapred.TableInputFormat, and not the > org.apache.hadoop.hbase.mapreduce.* classes, as you are using at the > moment. > > HTH! > > On Wed, Aug 15, 2012 at 2:39 AM, Håvard Wahl Kongsgård > wrote: >> Hi, I'am trying to read hbase key-values with pipes(pydoop). As hadoop >> is unable to find the hbase jar files. I get >> >> Exception in thread "main" java.lang.RuntimeException: >> java.lang.RuntimeException: class >> org.apache.hadoop.hbase.mapreduce.TableInputFormat not >> org.apache.hadoop.mapred.InputFormat >> >> have added export >> HADOOP_CLASSPATH=/usr/lib/hbase/hbase-0.90.6-cdh3u4.jar to my >> hadoop-env.sh >> >> According to the doc from cloudera, >> https://ccp.cloudera.com/display/CDHDOC/HBase+Installation#HBaseInstallation-UsingMapReducewithHBase >> TableMapReduceUtil.addDependencyJars(job); can be used as an >> alternative. But is that possible with pipes? >> >> -Håvard > > > > -- > Harsh J -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
pipes(pydoop) and hbase classpath
Hi, I'am trying to read hbase key-values with pipes(pydoop). As hadoop is unable to find the hbase jar files. I get Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: class org.apache.hadoop.hbase.mapreduce.TableInputFormat not org.apache.hadoop.mapred.InputFormat have added export HADOOP_CLASSPATH=/usr/lib/hbase/hbase-0.90.6-cdh3u4.jar to my hadoop-env.sh According to the doc from cloudera, https://ccp.cloudera.com/display/CDHDOC/HBase+Installation#HBaseInstallation-UsingMapReducewithHBase TableMapReduceUtil.addDependencyJars(job); can be used as an alternative. But is that possible with pipes? -Håvard