subject:"Re\: Parallel ALS\-WR on very large matrix \-\- crashing \(I think\)"

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-08 Thread Nicholas Kolegraff

Sorry to re-surface,
When I try to evaluate the factorization I am now running into this error:
java.lang.ArrayIndexOutOfBoundsException: 2
at
org.apache.mahout.cf.taste.hadoop.als.FactorizationEvaluator$PredictRatingsMapper.map(FactorizationEvaluator.java:137)
at
org.apache.mahout.cf.taste.hadoop.als.FactorizationEvaluator$PredictRatingsMapper.map(FactorizationEvaluator.java:117)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:629)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:310)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

However, This index (assuming it is a userID?) does exist in the training
and test set? (not sure if that matters?)


On Thu, Feb 2, 2012 at 11:25 AM, Ken Krugler kkrugler_li...@transpac.comwrote:

 Hi Nicholas,

 On Feb 2, 2012, at 10:56am, Nicholas Kolegraff wrote:

  Ok, I took a bit deeper look into this having changed some parameters and
  kicked off the new job..
 
  Seems plausible that I didn't have enough memory for some of the mappers
 --
  unless I'm missing something here.
  An upper bound on the memory would be (assuming my original parameter of
 25
  features)
  8Mil * 25 Features = 200Mil
  (multiply by 8 bytes assuming double precision floating point) and we
 get:
  1.6billion
  1.6B / (1024^3) = ~1.5GB memory needed
 
  The tasktracker heapsize and datanode heap sizes were only set to: 1GB

 The memory you need for this task is based on the mapped.child.java.opts
 setting (the -Xmx setting), not what's allocated for the NameNode,
 JobTracker, DataNode or TaskTracker.

 In fact increasing the DataNode  TaskTracker sizes removes memory that
 could/should be used by the child JVMs that the TaskTracker creates to run
 your map  reduce tasks.

 Currently it looks like you have 4GB allocated for m2.2xlarge tasks, which
 should be sufficient given your analysis above.

 -- Ken

 
  So I have changed the bootstrap action on EC2 as follows (this is a diff
  between the original and the changes I made)
  # Parameters of the array:
  # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum,
  mapred.tasktracker.reduce.tasks.maximum]
  29c29
 m2.2xlarge  = [-Xmx4096m, 6,  2],
  ---
   m2.2xlarge  = [-Xmx8192m, 3,  2],
  # Parameters of the array (Vars modified in hadoop.env.sh)
  # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE,
  HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE]
  47c47
 m2.2xlarge  = [2048, 8192, 1024, 1024],
  ---
   m2.2xlarge  = [4096, 16384, 2048, 2048]
 
 
 
  On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter s...@apache.org
 wrote:
 
  Hmm, are you sure that the mappers have enough memory? You can set that
  via Dmapred.child.java.opts=-Xmx[some number]m
 
  --sebastian
 
  On 02.02.2012 17:37, Nicholas Kolegraff wrote:
  Sounds good. Thanks Sebastian
 
  The interesting thing is -- I tried to sample the matrix down one time
 to
  about 10% of non-zeros -- and worked no problem.
 
  On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter s...@apache.org
  wrote:
 
  Your parameters look good, except if you have binary data, you should
  set --implicitFeedback=true. You could also set numFeatures to a very
  small value (like 5) just to see if that helps.
 
  The mappers load one of the feature matrices into memory which are
 dense
  (#items x #features entries or #users x #features entries). Are you
 sure
  that the mappers have enough memory for that?
 
  It's really strange that you have problems with such small data, I
  tested this with Netflix ( 100M non-zeros) on a few machines and it
  worked quite well.
 
  --sebastian
 
 
 
  On 02.02.2012 17:25, Nicholas Kolegraff wrote:
  I will up the ante with the time out and report back -- thanks all
 for
  the
  suggestions
 
  Hey, Sebastian -- Here are the arguments I am using:
  --input matrix --output ALS --numFeatures 25 --numIterations 10
  --lambda
  0.065
  When the mapper loads the matrix into memory it only loads the actual
  non-zero data, correct?
 
  Hey Ted -- I messed up on the sparsity.  Turns out there are only 70M
  non-zero elements.
 
  Oh, and, I only have binary data -- I wasn't sure of the implications
  with
  ALS-WR on binary data -- I couldn't find anything to suggest
 otherwise.
  I am using data of the format user,item,1
  I have read about probabilistic factorization -- which works with
  binary
  data -- and perhaps naively, thought ALS-WR was similar so
  what-the-heck
  :-)
 
  I'd love nothing more than to share the data, however, I'd probably
 get
  in
  some trouble :-)
  Perhaps I could generate a matrix with a similar distribution? --
 I'll
  have
  to check on that and see if it is ok #bureaucracy
 
  Stay tuned...
 
  On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter s...@apache.org
  wrote:
 
  Nicholas,
 
  can you give us the detailed arguments you start the job with? I'd
  especially be interested in

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-02 Thread Sean Owen

I have seen this happen in normal operation when the sorting on the
mapper is taking a long long time, because the output is large. You can
tell it to increase the timeout.  If this is what is happening, you won't
have a chance to update a counter as a keep-alive ping, but yes that is
generally right otherwise. If this is the case it's that a mapper is
outputting a whole lot of info, perhaps 'too much'. I don't know for sure,
just another a guess for the pile.

On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Status reporting happens automatically when output is generated.  In a long
 computation, it is good form to occasionally update a counter or otherwise
 indicate that the computation is still progressing.

 On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff
 nickkolegr...@gmail.comwrote:

  Do you know if it should still report status in the midst of a complex
  task?  Seems questionable that it wouldn't just send a friendly hello?

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-02 Thread Sebastian Schelter

Nicholas,

can you give us the detailed arguments you start the job with? I'd
especially be interested in the number of features (--numFeatures) you
use. Do you use the job with implicit feedback data
(--implicitFeedback=true)?

The memory requirements of the job are the following:

In each iteration either the item-features matrix (items x features) or
the user-features matrix (users x features) is loaded into the memory of
each mapper. Then the original user-item matrix (or its transpose) is
read row-wise by the mappers and they recompute the features via
AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver.

--sebastian


On 02.02.2012 09:53, Sean Owen wrote:
 I have seen this happen in normal operation when the sorting on the
 mapper is taking a long long time, because the output is large. You can
 tell it to increase the timeout.  If this is what is happening, you won't
 have a chance to update a counter as a keep-alive ping, but yes that is
 generally right otherwise. If this is the case it's that a mapper is
 outputting a whole lot of info, perhaps 'too much'. I don't know for sure,
 just another a guess for the pile.
 
 On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Status reporting happens automatically when output is generated.  In a long
 computation, it is good form to occasionally update a counter or otherwise
 indicate that the computation is still progressing.

 On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff
 nickkolegr...@gmail.comwrote:

 Do you know if it should still report status in the midst of a complex
 task?  Seems questionable that it wouldn't just send a friendly hello?

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-02 Thread Nicholas Kolegraff

I will up the ante with the time out and report back -- thanks all for the
suggestions

Hey, Sebastian -- Here are the arguments I am using:
--input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda
0.065
When the mapper loads the matrix into memory it only loads the actual
non-zero data, correct?

Hey Ted -- I messed up on the sparsity.  Turns out there are only 70M
non-zero elements.

Oh, and, I only have binary data -- I wasn't sure of the implications with
ALS-WR on binary data -- I couldn't find anything to suggest otherwise.
I am using data of the format user,item,1
I have read about probabilistic factorization -- which works with binary
data -- and perhaps naively, thought ALS-WR was similar so what-the-heck :-)

I'd love nothing more than to share the data, however, I'd probably get in
some trouble :-)
Perhaps I could generate a matrix with a similar distribution? -- I'll have
to check on that and see if it is ok #bureaucracy

Stay tuned...

On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter s...@apache.org wrote:

 Nicholas,

 can you give us the detailed arguments you start the job with? I'd
 especially be interested in the number of features (--numFeatures) you
 use. Do you use the job with implicit feedback data
 (--implicitFeedback=true)?

 The memory requirements of the job are the following:

 In each iteration either the item-features matrix (items x features) or
 the user-features matrix (users x features) is loaded into the memory of
 each mapper. Then the original user-item matrix (or its transpose) is
 read row-wise by the mappers and they recompute the features via

 AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver.

 --sebastian


 On 02.02.2012 09:53, Sean Owen wrote:
  I have seen this happen in normal operation when the sorting on the
  mapper is taking a long long time, because the output is large. You can
  tell it to increase the timeout.  If this is what is happening, you won't
  have a chance to update a counter as a keep-alive ping, but yes that is
  generally right otherwise. If this is the case it's that a mapper is
  outputting a whole lot of info, perhaps 'too much'. I don't know for
 sure,
  just another a guess for the pile.
 
  On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  Status reporting happens automatically when output is generated.  In a
 long
  computation, it is good form to occasionally update a counter or
 otherwise
  indicate that the computation is still progressing.
 
  On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff
  nickkolegr...@gmail.comwrote:
 
  Do you know if it should still report status in the midst of a complex
  task?  Seems questionable that it wouldn't just send a friendly hello?

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-02 Thread Nicholas Kolegraff

Sounds good. Thanks Sebastian

The interesting thing is -- I tried to sample the matrix down one time to
about 10% of non-zeros -- and worked no problem.

On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter s...@apache.org wrote:

 Your parameters look good, except if you have binary data, you should
 set --implicitFeedback=true. You could also set numFeatures to a very
 small value (like 5) just to see if that helps.

 The mappers load one of the feature matrices into memory which are dense
 (#items x #features entries or #users x #features entries). Are you sure
 that the mappers have enough memory for that?

 It's really strange that you have problems with such small data, I
 tested this with Netflix ( 100M non-zeros) on a few machines and it
 worked quite well.

 --sebastian



 On 02.02.2012 17:25, Nicholas Kolegraff wrote:
  I will up the ante with the time out and report back -- thanks all for
 the
  suggestions
 
  Hey, Sebastian -- Here are the arguments I am using:
  --input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda
  0.065
  When the mapper loads the matrix into memory it only loads the actual
  non-zero data, correct?
 
  Hey Ted -- I messed up on the sparsity.  Turns out there are only 70M
  non-zero elements.
 
  Oh, and, I only have binary data -- I wasn't sure of the implications
 with
  ALS-WR on binary data -- I couldn't find anything to suggest otherwise.
  I am using data of the format user,item,1
  I have read about probabilistic factorization -- which works with binary
  data -- and perhaps naively, thought ALS-WR was similar so what-the-heck
 :-)
 
  I'd love nothing more than to share the data, however, I'd probably get
 in
  some trouble :-)
  Perhaps I could generate a matrix with a similar distribution? -- I'll
 have
  to check on that and see if it is ok #bureaucracy
 
  Stay tuned...
 
  On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter s...@apache.org
 wrote:
 
  Nicholas,
 
  can you give us the detailed arguments you start the job with? I'd
  especially be interested in the number of features (--numFeatures) you
  use. Do you use the job with implicit feedback data
  (--implicitFeedback=true)?
 
  The memory requirements of the job are the following:
 
  In each iteration either the item-features matrix (items x features) or
  the user-features matrix (users x features) is loaded into the memory of
  each mapper. Then the original user-item matrix (or its transpose) is
  read row-wise by the mappers and they recompute the features via
 
 
 AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver.
 
  --sebastian
 
 
  On 02.02.2012 09:53, Sean Owen wrote:
  I have seen this happen in normal operation when the sorting on the
  mapper is taking a long long time, because the output is large. You can
  tell it to increase the timeout.  If this is what is happening, you
 won't
  have a chance to update a counter as a keep-alive ping, but yes that is
  generally right otherwise. If this is the case it's that a mapper is
  outputting a whole lot of info, perhaps 'too much'. I don't know for
  sure,
  just another a guess for the pile.
 
  On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  Status reporting happens automatically when output is generated.  In a
  long
  computation, it is good form to occasionally update a counter or
  otherwise
  indicate that the computation is still progressing.
 
  On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff
  nickkolegr...@gmail.comwrote:
 
  Do you know if it should still report status in the midst of a
 complex
  task?  Seems questionable that it wouldn't just send a friendly
 hello?

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-02 Thread Ken Krugler

Hi Nicholas,

On Feb 2, 2012, at 10:56am, Nicholas Kolegraff wrote:

 Ok, I took a bit deeper look into this having changed some parameters and
 kicked off the new job..
 
 Seems plausible that I didn't have enough memory for some of the mappers --
 unless I'm missing something here.
 An upper bound on the memory would be (assuming my original parameter of 25
 features)
 8Mil * 25 Features = 200Mil
 (multiply by 8 bytes assuming double precision floating point) and we get:
 1.6billion
 1.6B / (1024^3) = ~1.5GB memory needed
 
 The tasktracker heapsize and datanode heap sizes were only set to: 1GB

The memory you need for this task is based on the mapped.child.java.opts 
setting (the -Xmx setting), not what's allocated for the NameNode, JobTracker, 
DataNode or TaskTracker.

In fact increasing the DataNode  TaskTracker sizes removes memory that 
could/should be used by the child JVMs that the TaskTracker creates to run your 
map  reduce tasks.

Currently it looks like you have 4GB allocated for m2.2xlarge tasks, which 
should be sufficient given your analysis above.

-- Ken

 
 So I have changed the bootstrap action on EC2 as follows (this is a diff
 between the original and the changes I made)
 # Parameters of the array:
 # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum,
 mapred.tasktracker.reduce.tasks.maximum]
 29c29
m2.2xlarge  = [-Xmx4096m, 6,  2],
 ---
  m2.2xlarge  = [-Xmx8192m, 3,  2],
 # Parameters of the array (Vars modified in hadoop.env.sh)
 # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE,
 HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE]
 47c47
m2.2xlarge  = [2048, 8192, 1024, 1024],
 ---
  m2.2xlarge  = [4096, 16384, 2048, 2048]
 
 
 
 On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter s...@apache.org wrote:
 
 Hmm, are you sure that the mappers have enough memory? You can set that
 via Dmapred.child.java.opts=-Xmx[some number]m
 
 --sebastian
 
 On 02.02.2012 17:37, Nicholas Kolegraff wrote:
 Sounds good. Thanks Sebastian
 
 The interesting thing is -- I tried to sample the matrix down one time to
 about 10% of non-zeros -- and worked no problem.
 
 On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter s...@apache.org
 wrote:
 
 Your parameters look good, except if you have binary data, you should
 set --implicitFeedback=true. You could also set numFeatures to a very
 small value (like 5) just to see if that helps.
 
 The mappers load one of the feature matrices into memory which are dense
 (#items x #features entries or #users x #features entries). Are you sure
 that the mappers have enough memory for that?
 
 It's really strange that you have problems with such small data, I
 tested this with Netflix ( 100M non-zeros) on a few machines and it
 worked quite well.
 
 --sebastian
 
 
 
 On 02.02.2012 17:25, Nicholas Kolegraff wrote:
 I will up the ante with the time out and report back -- thanks all for
 the
 suggestions
 
 Hey, Sebastian -- Here are the arguments I am using:
 --input matrix --output ALS --numFeatures 25 --numIterations 10
 --lambda
 0.065
 When the mapper loads the matrix into memory it only loads the actual
 non-zero data, correct?
 
 Hey Ted -- I messed up on the sparsity.  Turns out there are only 70M
 non-zero elements.
 
 Oh, and, I only have binary data -- I wasn't sure of the implications
 with
 ALS-WR on binary data -- I couldn't find anything to suggest otherwise.
 I am using data of the format user,item,1
 I have read about probabilistic factorization -- which works with
 binary
 data -- and perhaps naively, thought ALS-WR was similar so
 what-the-heck
 :-)
 
 I'd love nothing more than to share the data, however, I'd probably get
 in
 some trouble :-)
 Perhaps I could generate a matrix with a similar distribution? -- I'll
 have
 to check on that and see if it is ok #bureaucracy
 
 Stay tuned...
 
 On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter s...@apache.org
 wrote:
 
 Nicholas,
 
 can you give us the detailed arguments you start the job with? I'd
 especially be interested in the number of features (--numFeatures) you
 use. Do you use the job with implicit feedback data
 (--implicitFeedback=true)?
 
 The memory requirements of the job are the following:
 
 In each iteration either the item-features matrix (items x features)
 or
 the user-features matrix (users x features) is loaded into the memory
 of
 each mapper. Then the original user-item matrix (or its transpose) is
 read row-wise by the mappers and they recompute the features via
 
 
 
 AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver.
 
 --sebastian
 
 
 On 02.02.2012 09:53, Sean Owen wrote:
 I have seen this happen in normal operation when the sorting on the
 mapper is taking a long long time, because the output is large. You
 can
 tell it to increase the timeout.  If this is what is happening, you
 won't
 have a chance to update a counter as a keep-alive ping, but yes that
 is
 generally right otherwise. If this is the case it's that a

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-02 Thread Nicholas Kolegraff

Success!
Thanks all!!
I changed the --numFeatures option to 5 and it went through no problems,
however, the final step of 'SolveExplicitFeedback'  took a very long time
relative to the others
thus, I suspect the other suggestions of changing mapred.task.timeout to
something much larger than 600 seconds would also have fixed the issue.
(given you have enough memory)

On Thu, Feb 2, 2012 at 11:25 AM, Ken Krugler kkrugler_li...@transpac.comwrote:

 Hi Nicholas,

 On Feb 2, 2012, at 10:56am, Nicholas Kolegraff wrote:

  Ok, I took a bit deeper look into this having changed some parameters and
  kicked off the new job..
 
  Seems plausible that I didn't have enough memory for some of the mappers
 --
  unless I'm missing something here.
  An upper bound on the memory would be (assuming my original parameter of
 25
  features)
  8Mil * 25 Features = 200Mil
  (multiply by 8 bytes assuming double precision floating point) and we
 get:
  1.6billion
  1.6B / (1024^3) = ~1.5GB memory needed
 
  The tasktracker heapsize and datanode heap sizes were only set to: 1GB

 The memory you need for this task is based on the mapped.child.java.opts
 setting (the -Xmx setting), not what's allocated for the NameNode,
 JobTracker, DataNode or TaskTracker.

 In fact increasing the DataNode  TaskTracker sizes removes memory that
 could/should be used by the child JVMs that the TaskTracker creates to run
 your map  reduce tasks.

 Currently it looks like you have 4GB allocated for m2.2xlarge tasks, which
 should be sufficient given your analysis above.

 -- Ken

 
  So I have changed the bootstrap action on EC2 as follows (this is a diff
  between the original and the changes I made)
  # Parameters of the array:
  # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum,
  mapred.tasktracker.reduce.tasks.maximum]
  29c29
 m2.2xlarge  = [-Xmx4096m, 6,  2],
  ---
   m2.2xlarge  = [-Xmx8192m, 3,  2],
  # Parameters of the array (Vars modified in hadoop.env.sh)
  # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE,
  HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE]
  47c47
 m2.2xlarge  = [2048, 8192, 1024, 1024],
  ---
   m2.2xlarge  = [4096, 16384, 2048, 2048]
 
 
 
  On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter s...@apache.org
 wrote:
 
  Hmm, are you sure that the mappers have enough memory? You can set that
  via Dmapred.child.java.opts=-Xmx[some number]m
 
  --sebastian
 
  On 02.02.2012 17:37, Nicholas Kolegraff wrote:
  Sounds good. Thanks Sebastian
 
  The interesting thing is -- I tried to sample the matrix down one time
 to
  about 10% of non-zeros -- and worked no problem.
 
  On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter s...@apache.org
  wrote:
 
  Your parameters look good, except if you have binary data, you should
  set --implicitFeedback=true. You could also set numFeatures to a very
  small value (like 5) just to see if that helps.
 
  The mappers load one of the feature matrices into memory which are
 dense
  (#items x #features entries or #users x #features entries). Are you
 sure
  that the mappers have enough memory for that?
 
  It's really strange that you have problems with such small data, I
  tested this with Netflix ( 100M non-zeros) on a few machines and it
  worked quite well.
 
  --sebastian
 
 
 
  On 02.02.2012 17:25, Nicholas Kolegraff wrote:
  I will up the ante with the time out and report back -- thanks all
 for
  the
  suggestions
 
  Hey, Sebastian -- Here are the arguments I am using:
  --input matrix --output ALS --numFeatures 25 --numIterations 10
  --lambda
  0.065
  When the mapper loads the matrix into memory it only loads the actual
  non-zero data, correct?
 
  Hey Ted -- I messed up on the sparsity.  Turns out there are only 70M
  non-zero elements.
 
  Oh, and, I only have binary data -- I wasn't sure of the implications
  with
  ALS-WR on binary data -- I couldn't find anything to suggest
 otherwise.
  I am using data of the format user,item,1
  I have read about probabilistic factorization -- which works with
  binary
  data -- and perhaps naively, thought ALS-WR was similar so
  what-the-heck
  :-)
 
  I'd love nothing more than to share the data, however, I'd probably
 get
  in
  some trouble :-)
  Perhaps I could generate a matrix with a similar distribution? --
 I'll
  have
  to check on that and see if it is ok #bureaucracy
 
  Stay tuned...
 
  On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter s...@apache.org
  wrote:
 
  Nicholas,
 
  can you give us the detailed arguments you start the job with? I'd
  especially be interested in the number of features (--numFeatures)
 you
  use. Do you use the job with implicit feedback data
  (--implicitFeedback=true)?
 
  The memory requirements of the job are the following:
 
  In each iteration either the item-features matrix (items x features)
  or
  the user-features matrix (users x features) is loaded into the
 memory
  of
  each mapper. Then the original user-item matrix (or its transpose)
 is
  read

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-01 Thread Kate Ericson

Hi,

This *may* just be a Hadoop issue - it sounds like the JobTracker is
upset that it hasn't heard from one of the workers in too long (over
600 seconds).
Can you check your Hadoop Administration pages for the cluster?  Does
the cluster still seem to be functioning?
I haven't used Hadoop with EC2, so I'm not sure how difficult it will
be to check the cluster :-/
If everything seems to be OK, there's a Hadoop setting to modify how
long it's willing to wait before assuming a machine has failed and
killing a task.


-Kate

On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff
nickkolegr...@gmail.com wrote:
 Hello,
 I am attempting to run parallelALS on a very large matrix on EC2.
 The matrix is ~8 Million x 1 million. vary sparse .007% has data.
 I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge)
 (I kept getting OutOfMemory exceptions so I kept upping the ante until I
 arrived at the above configuration)

 It makes it through the following jobs no problem:
 ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe
 ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce
 ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re
 ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM
 
 ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM

 Then crashes here with only the following error messages:
 Task attempt_201201311814_0023_m_00_0 failed to report status for 600
 seconds. Killing!

 Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's
 status?

 I'm not sure what is causing this -- I am still trying to wrap my head
 around the mahout API.

 Could this still be a memory issue?

 Hopefully i'm not missing something trivial?!?!

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-01 Thread Ted Dunning

So the total size of the data is modest at about 560 M non-zero elements.
 Total data should be small compared to your node sizes.

But the distribution of your data can be important as well.

Can you say if you have any rows or columns are extremely dense?

On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson eric...@cs.colostate.eduwrote:

  The matrix is ~8 Million x 1 million. vary sparse .007% has data.

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-01 Thread Nicholas Kolegraff

Thanks for the prompt reply Kate!

The cluster has since been torn down on EC2 but, I did monitor it during
the job execution and all seemed to be ok.  JobTracker and NameNode would
continue to report status.

I was aware of the configuration setting and hoping to refrain from playing
with it :-) I get scared to modify it too large, since that time could get
unnecessarily charged to my EC2 account. :S

Do you know if it should still report status in the midst of a complex
task?  Seems questionable that it wouldn't just send a friendly hello?

On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson eric...@cs.colostate.eduwrote:

 Hi,

 This *may* just be a Hadoop issue - it sounds like the JobTracker is
 upset that it hasn't heard from one of the workers in too long (over
 600 seconds).
 Can you check your Hadoop Administration pages for the cluster?  Does
 the cluster still seem to be functioning?
 I haven't used Hadoop with EC2, so I'm not sure how difficult it will
 be to check the cluster :-/
 If everything seems to be OK, there's a Hadoop setting to modify how
 long it's willing to wait before assuming a machine has failed and
 killing a task.


 -Kate

 On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff
 nickkolegr...@gmail.com wrote:
  Hello,
  I am attempting to run parallelALS on a very large matrix on EC2.
  The matrix is ~8 Million x 1 million. vary sparse .007% has data.
  I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge)
  (I kept getting OutOfMemory exceptions so I kept upping the ante until I
  arrived at the above configuration)
 
  It makes it through the following jobs no problem:
 
 ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe
 
 ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce
 
 ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re
 
 ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM
  
 
 ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM
 
  Then crashes here with only the following error messages:
  Task attempt_201201311814_0023_m_00_0 failed to report status for 600
  seconds. Killing!
 
  Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's
  status?
 
  I'm not sure what is causing this -- I am still trying to wrap my head
  around the mahout API.
 
  Could this still be a memory issue?
 
  Hopefully i'm not missing something trivial?!?!

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-01 Thread Kate Ericson

If it's thrashing on something, there's a good chance it might miss a
checkpoint.
Like Ted brought up, there may be some very dense areas of your input
causing this problem.
How much memory are you giving to your Hadoop workers? The default
value is rather small.

-Kate

On Wed, Feb 1, 2012 at 6:23 PM, Nicholas Kolegraff
nickkolegr...@gmail.com wrote:
Thanks for the prompt reply Kate!

The cluster has since been torn down on EC2 but, I did monitor it during
the job execution and all seemed to be ok. JobTracker and NameNode would
continue to report status.

I was aware of the configuration setting and hoping to refrain from playing
with it :-) I get scared to modify it too large, since that time could get
unnecessarily charged to my EC2 account. :S

Do you know if it should still report status in the midst of a complex
task? Seems questionable that it wouldn't just send a friendly hello?

On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson eric...@cs.colostate.eduwrote:

Hi,

This *may* just be a Hadoop issue - it sounds like the JobTracker is
upset that it hasn't heard from one of the workers in too long (over
600 seconds).
Can you check your Hadoop Administration pages for the cluster? Does
the cluster still seem to be functioning?
I haven't used Hadoop with EC2, so I'm not sure how difficult it will
be to check the cluster :-/
If everything seems to be OK, there's a Hadoop setting to modify how
long it's willing to wait before assuming a machine has failed and
killing a task.

-Kate

On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff
nickkolegr...@gmail.com wrote:
Hello,
I am attempting to run parallelALS on a very large matrix on EC2.
The matrix is ~8 Million x 1 million. vary sparse .007% has data.
I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge)
(I kept getting OutOfMemory exceptions so I kept upping the ante until I
arrived at the above configuration)

It makes it through the following jobs no problem:

ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe

ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce

ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re

ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM

ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM

Then crashes here with only the following error messages:
Task attempt_201201311814_0023_m_00_0 failed to report status for 600
seconds. Killing!

Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's
status?

I'm not sure what is causing this -- I am still trying to wrap my head
around the mahout API.

Could this still be a memory issue?

Hopefully i'm not missing something trivial?!?!

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-01 Thread Nicholas Kolegraff

The most dense row contains about 55K elements of the 1M
There are about 5 other rows with about 10K then drops considerably after
that for others ~2K

I am using the memory-intensive bootstrap action on EC2 -- which bumps heap
space for the childs to around 4G, I believe.

On Wed, Feb 1, 2012 at 5:44 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Status reporting happens automatically when output is generated.  In a long
 computation, it is good form to occasionally update a counter or otherwise
 indicate that the computation is still progressing.

 On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff
 nickkolegr...@gmail.comwrote:

  Do you know if it should still report status in the midst of a complex
  task?  Seems questionable that it wouldn't just send a friendly hello?

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

12 matches

Site Navigation

Mail list logo

Footer information