Re: current line number as key?

2011-05-18 Thread Jonathan Coveney
To the best of my knowledge, the only way to do this is if you have fix
width columns.

Think about it this way: as alexandra mentioned, you only get byte
difference...if you split 1 file among 50 mappers, they have the offset, but
they have no idea that that offset means. with respect to other the other
files, as they do not know how many lines came before. Finding lines
inherently involves a full scan, unless a) the width is fixed or b) you do a
job beforehand to explicitly put the line in the document.

I would think about what you want to do, and whether or not it is possible
to avoid making it line dependent, or if you can make each row a fixed
number of bytes...

2011/5/18 Alexandra Anghelescu 

> Hi,
>
> It is hard to pick up certain lines of a text file - globally I mean.
> Remember that the file is split according to its size (byte boundries) not
> lines.,, so, it is possible to keep track of the lines inside a split, but
> globally for the whole file, assuming it is split among map tasks... i
> don't
> think it is possible.. I am new to hadoop, but that is my take on it.
>
> Alexandra
>
> On Wed, May 18, 2011 at 2:41 PM, bnonymous  wrote:
>
> >
> > Hello,
> >
> > I'm trying to pick up certain lines of a text file. (say 1st, 110th line
> of
> > a file with 10^10 lines). I need a InputFormat which gives the Mapper
> line
> > number as the key.
> >
> > I tried to implement RecordReader, but I can't get line information from
> > InputSplit.
> >
> > Any solution to this???
> >
> > Thanks in advance!!!
> > --
> > View this message in context:
> >
> http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>


Re: Installing Hadoop

2011-05-23 Thread Jonathan Coveney
try: http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html ?

2011/5/23 jgroups 

>
> I am trying to install hadoop in cluster env with multiple nodes. Following
> instructions from
>
> http://hadoop.apache.org/common/docs/r0.17.0/cluster_setup.html
> http://hadoop.apache.org/common/docs/r0.17.0/cluster_setup.html
>
> That page refers to hadoop-site.xml. But I don't see that in
> /hadoop-0.20.203.0/conf. Are there more upto date installation instructions
> somwhere else?
> --
> View this message in context:
> http://old.nabble.com/Installing-Hadoop-tp31683812p31683812.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: Help with pigsetup

2011-05-26 Thread Jonathan Coveney
I'll repost it here then :)

"Here is what I had to do to get pig running with a different version of
Hadoop (in my case, the cloudera build but I'd try this as well):

build pig-withouthadoop.jar by running "ant jar-withouthadoop". Then, when
you run pig, put the pig-withouthadoop.jar on your classpath as well as your
hadoop jar. In my case, I found that scripts only worked if I additionally
manually registered the antlr jar:

register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar;"

2011/5/26 Mohit Anchlia 

> For some reason I don't see that reply from Jonathan in my Inbox. I'll
> try to google it.
>
> What should be my next step in that case? I can't use pig then?
>
> On Thu, May 26, 2011 at 10:00 AM, Harsh J  wrote:
> > I think Jonathan Coveney's reply on user@pig answered your question.
> > Its basically an issue of hadoop version differences between the one
> > Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which
> > is newer.
> >
> > On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia 
> wrote:
> >> I sent this to pig apache user mailing list but have got no response.
> >> Not sure if that list is still active.
> >>
> >> thought I will post here if someone is able to help me.
> >>
> >> I am in process of installing and learning pig. I have a hadoop
> >> cluster and when I try to run pig in mapreduce mode it errors out:
> >>
> >> Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1
> >>
> >> Error before Pig is launched
> >> 
> >> ERROR 2999: Unexpected internal error. Failed to create DataStorage
> >>
> >> java.lang.RuntimeException: Failed to create DataStorage
> >>   at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
> >>   at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.(HDataStorage.java:58)
> >>   at
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
> >>   at
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
> >>   at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
> >>   at org.apache.pig.PigServer.(PigServer.java:226)
> >>   at org.apache.pig.PigServer.(PigServer.java:215)
> >>   at org.apache.pig.tools.grunt.Grunt.(Grunt.java:55)
> >>   at org.apache.pig.Main.run(Main.java:452)
> >>   at org.apache.pig.Main.main(Main.java:107)
> >> Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
> >> failed on local exception: java.io.EOFException
> >>   at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
> >>   at org.apache.hadoop.ipc.Client.call(Client.java:743)
> >>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> >>   at $Proxy0.getProtocolVersion(Unknown Source)
> >>   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
> >>   at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
> >>   at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:207)
> >>   at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:170)
> >>   at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
> >>   at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
> >>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
> >>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
> >>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
> >>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
> >>   at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
> >>   ... 9 more
> >> Caused by: java.io.EOFException
> >>   at java.io.DataInputStream.readInt(DataInputStream.java:375)
> >>   at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
> >>   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> >>
> >
> >
> >
> > --
> > Harsh J
> >
>


Re: Which release to use?

2011-07-15 Thread Jonathan Coveney
Isaac: there is no more yahoo branch. They are committing all of their code
to apache.

2011/7/15 Isaac Dooley 

> Will 0.23 include Kerberos authentication? Will this finally unite the
> Yahoo and Apache branches?
>
> -Original Message-
> From: Arun C Murthy [mailto:a...@hortonworks.com]
> Sent: Thursday, July 14, 2011 7:43 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Which release to use?
>
> Hi,
>
>  0.20.203 is the latest stable release which includes a ton of features
> (security - kerberos based authentication) and fixes. Its currently deployed
> at over 50k machines at Yahoo too.
>  So, yes, I'd encourage you to use 0.20.203. We, the community, are
> currently working on hadoop-0.23 and hope to get it out soon.
>
> thanks,
> Arun
>
> On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote:
>
> > I'm a newbie and I am confused by the Hadoop releases.
> > I thought 0.21.0 is the latest & greatest release that I
> > should be using but I noticed 0.20.203 has been released
> > lately, and 0.21.X is marked "unstable, unsupported".
> >
> > Should I be using 0.20.203?
> > 
> > T. "Kuro" Kurosaka
> >
> >
>
>