I am afraid I don't know the answer. Need to experiment a bit more. I have
not used CompositeInputFormat so cannot comment.

Probably, someone else on the ML(Mailing List) would be able to guide here.


On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi <stutiawas...@hcl.com> wrote:

> Thanks Ashish,
>
> So according to the link if one is using CompositeInputFormat then it will
> take entire file as Input to a mapper without considering
> InputSplits/blocksize.
> If I am understanding it correctly then it is asking to break [Original
> Input File]->[flie1,file2,....] .
>
> So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1,
> [/test/smallfiles/file2, [/test/smallfiles/file3...............  ]
>
> Now will the input path in MatrixMultiplicationJob will be directory path
> : /test/smallfiles  ??
>
> Will breaking file in such manner will cause problem in algorithmic
> execution of MR job. Im not sure if output will be correct .
>
> -----Original Message-----
> From: Ashish [mailto:paliwalash...@gmail.com]
> Sent: Wednesday, January 16, 2013 5:44 PM
> To: user@mahout.apache.org
> Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
>
> MatrixMultiplicationJob internally sets InputFormat as CompositeInputFormat
>
> JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class);
> conf.setInputFormat(CompositeInputFormat.class);
>
> and AFAIK, CompositeInputFormat ignores the splits. See this
> http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join
>
> Unfortunately, I don't know any other alternative as of now.
>
>
> On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi <stutiawas...@hcl.com>
> wrote:
>
> > The issue is that currently my matrix is of dimension (100x100k),
> > Later it can be (1MX10M) or big.
> >
> > Even now if my job is running with the single mapper for (100x100k)
> > and it is not able to complete the Job. As I mentioned map task just
> > proceed to 0.99% and started spilling the map output. Hence I wanted
> > to tune my job so that Mahout is able to complete the job and I can
> > utilize my cluster resources.
> >
> > As MatrixMultiplicationJob is a MR, so it should be able to handle
> > parallel map tasks. I am not sure if there is any algorithmic
> > constraints due to which it runs only with single mapper ?
> > I have taken the reference of thread so that I can set Configuration
> > myself rather by getting it with getConf() but did not got any success
> >
> > http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
> > ers-in-DistributedRowMatrix-Jobs-td888980.html
> >
> > Stuti
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:sro...@gmail.com]
> > Sent: Wednesday, January 16, 2013 4:46 PM
> > To: Mahout User List
> > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> >
> > Why do you need multiple mappers? Is one too slow? Many are not
> > necessarily faster for small input On Jan 16, 2013 10:46 AM, "Stuti
> > Awasthi" <stutiawas...@hcl.com> wrote:
> >
> > > Hi,
> > > I tried to call programmatically also but facing same issue : Only
> > > single MapTask is running and that too spilling the map output
> >  continuously.
> > > Hence im not able to generate the output for large matrix
> multiplication.
> > >
> > > Code Snippet :
> > >
> > > DistributedRowMatrix a = new DistributedRowMatrix(new
> > > Path("/test/points/matrixA"), new
> > > Path("/test/temp"),Integer.parseInt("100"),
> > > Integer.parseInt("100000")); DistributedRowMatrix b = new
> > > DistributedRowMatrix(new Path("/test/points/matrixA"),new
> > > Path("tempDir"),Integer.parseInt("100"),
> > > Integer.parseInt("100000"));
> > > Configuration conf = new Configuration();
> > > conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818");
> > > conf.set("mapred.child.java.opts",
> > > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
> > > a.setConf(conf);
> > > b.setConf(conf);
> > > a.times(b);
> > >
> > > Where Im going wrong. Any idea ?
> > >
> > > Thanks
> > > Stuti
> > > -----Original Message-----
> > > From: Stuti Awasthi
> > > Sent: Wednesday, January 16, 2013 2:55 PM
> > > To: Mahout User List
> > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > Hey Sean,
> > > Thanks for response. MatrixMultiplicationJob help shows the usage like
> :
> > > usage: <command> [Generic Options] [Job-Specific Options]
> > >
> > > Here Generic Option can be provided by -D <property=value>. Hence I
> > > tried with commandline -D options but it seems like that it is not
> > > making any effect.  It is also suggested in :
> > >
> > > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/maho
> > > ut
> > > /common/AbstractJob.html
> > >
> > > Here I have noted 1 thing after your suggestion  that currently Im
> > > passing arguments like -D<property=value> rather than -D
> > > <property=value>. I tried with space between -D and property=value
> > > also but then its giving error
> > > like:
> > > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected
> > > /test/points/matrixA while processing Job-Specific Options:
> > >
> > > No such error comes if im passing the arguments without space between
> -D.
> > >
> > > By reference of Hadoop Definite Guide : "Do not confuse setting
> > > Hadoop properties using the -D property=value option to
> > > GenericOptionsParser (and
> > > ToolRunner) with setting JVM system properties using the
> > > -Dproperty=value option to the java command. The syntax for JVM
> > > system properties does not allow any whitespace between the D and
> > > the property name, whereas GenericOptionsParser requires them to be
> > > separated by whitespace."
> > >
> > > Hence I suppose that GenericOptions should be parsed by -D
> > > property=value rather than -Dproperty=value.
> > >
> > > Additionally I tried -Dmapred.max.split.size=10485760 also through
> > > commandline but again only single MapTask started.
> > >
> > > Please Suggest
> > >
> > >
> > > -----Original Message-----
> > > From: Sean Owen [mailto:sro...@gmail.com]
> > > Sent: Wednesday, January 16, 2013 1:23 PM
> > > To: Mahout User List
> > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > It's up to Hadoop in the end.
> > >
> > > Try calling FileInputFormat.setMaxInputSplitSize() with a smallish
> > > value, like your 10MB (10000000).
> > >
> > > I don't know if Hadoop params can be set as sys properties like that
> > > anyway?
> > >
> > > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi
> > > <stutiawas...@hcl.com>
> > > wrote:
> > > > Hi,
> > > >
> > > > I am trying to multiple dense matrix of size [100 x 100k]. The
> > > > size of
> > > the file is 104MB and with default block sizeof 64MB only 2 blocks
> > > are getting created.
> > > > So I reduced the block size to 10MB and now my file divided into
> > > > 11
> > > blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and
> > > 9 DN/TT.
> > > >
> > > > Everytime Im running Mahout MatrixMultiplicationJob through
> > > > commandline,
> > > I can see on JobTracker WebUI that only 1 map task is launched.
> > > According to my understanding of Inputsplit, there should be 11 map
> > tasks launched.
> > > > Apart from this Map task stays at 0.99% completion and in the
> > > > Tasks Logs
> > > , I can see that map task is spilling the map output.
> > > >
> > > > Mahout Command:
> > > >
> > > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
> > > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA
> > > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100
> > > > --numColsB
> > > > 100000 --tempDir /test/temp
> > > >
> > > > Now here I want to know that why only 1 map task is launched
> > > > everytime
> > > and how can I performance tune the cluster so that I can perform the
> > > dense matrix multiplication of the order [90K x 1 Million] .
> > > >
> > > > Thanks
> > > > Stuti
> > > >
> > > >
> > > > ::DISCLAIMER::
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > --------
> > > >
> > > > The contents of this e-mail and any attachment(s) are confidential
> > > > and
> > > intended for the named recipient(s) only.
> > > > E-mail transmission is not guaranteed to be secure or error-free
> > > > as information could be intercepted, corrupted, lost, destroyed,
> > > > arrive late or incomplete, or may contain viruses in transmission.
> > > > The e mail
> > > and its contents (with or without referred errors) shall therefore
> > > not attach any liability on the originator or HCL or its affiliates.
> > > > Views or opinions, if any, presented in this email are solely
> > > > those of the author and may not necessarily reflect the views or
> > > > opinions of HCL or its affiliates. Any form of reproduction,
> > > > dissemination, copying, disclosure, modification, distribution and
> > > > / or publication of
> > > this message without the prior written consent of authorized
> > > representative of HCL is strictly prohibited. If you have received
> > > this email in error please delete it and notify the sender immediately.
> > > > Before opening any email and/or attachments, please check them for
> > > viruses and other defects.
> > > >
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > --------
> > >
> >
>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Reply via email to