Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Sorry to re-surface, When I try to evaluate the factorization I am now running into this error: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.mahout.cf.taste.hadoop.als.FactorizationEvaluator$PredictRatingsMapper.map(FactorizationEvaluator.java:137) at org.apache.mahout.cf.taste.hadoop.als.FactorizationEvaluator$PredictRatingsMapper.map(FactorizationEvaluator.java:117) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:629) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:310) at org.apache.hadoop.mapred.Child.main(Child.java:170) However, This index (assuming it is a userID?) does exist in the training and test set? (not sure if that matters?) On Thu, Feb 2, 2012 at 11:25 AM, Ken Krugler kkrugler_li...@transpac.comwrote: Hi Nicholas, On Feb 2, 2012, at 10:56am, Nicholas Kolegraff wrote: Ok, I took a bit deeper look into this having changed some parameters and kicked off the new job.. Seems plausible that I didn't have enough memory for some of the mappers -- unless I'm missing something here. An upper bound on the memory would be (assuming my original parameter of 25 features) 8Mil * 25 Features = 200Mil (multiply by 8 bytes assuming double precision floating point) and we get: 1.6billion 1.6B / (1024^3) = ~1.5GB memory needed The tasktracker heapsize and datanode heap sizes were only set to: 1GB The memory you need for this task is based on the mapped.child.java.opts setting (the -Xmx setting), not what's allocated for the NameNode, JobTracker, DataNode or TaskTracker. In fact increasing the DataNode TaskTracker sizes removes memory that could/should be used by the child JVMs that the TaskTracker creates to run your map reduce tasks. Currently it looks like you have 4GB allocated for m2.2xlarge tasks, which should be sufficient given your analysis above. -- Ken So I have changed the bootstrap action on EC2 as follows (this is a diff between the original and the changes I made) # Parameters of the array: # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum] 29c29 m2.2xlarge = [-Xmx4096m, 6, 2], --- m2.2xlarge = [-Xmx8192m, 3, 2], # Parameters of the array (Vars modified in hadoop.env.sh) # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE, HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE] 47c47 m2.2xlarge = [2048, 8192, 1024, 1024], --- m2.2xlarge = [4096, 16384, 2048, 2048] On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter s...@apache.org wrote: Hmm, are you sure that the mappers have enough memory? You can set that via Dmapred.child.java.opts=-Xmx[some number]m --sebastian On 02.02.2012 17:37, Nicholas Kolegraff wrote: Sounds good. Thanks Sebastian The interesting thing is -- I tried to sample the matrix down one time to about 10% of non-zeros -- and worked no problem. On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter s...@apache.org wrote: Your parameters look good, except if you have binary data, you should set --implicitFeedback=true. You could also set numFeatures to a very small value (like 5) just to see if that helps. The mappers load one of the feature matrices into memory which are dense (#items x #features entries or #users x #features entries). Are you sure that the mappers have enough memory for that? It's really strange that you have problems with such small data, I tested this with Netflix ( 100M non-zeros) on a few machines and it worked quite well. --sebastian On 02.02.2012 17:25, Nicholas Kolegraff wrote: I will up the ante with the time out and report back -- thanks all for the suggestions Hey, Sebastian -- Here are the arguments I am using: --input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda 0.065 When the mapper loads the matrix into memory it only loads the actual non-zero data, correct? Hey Ted -- I messed up on the sparsity. Turns out there are only 70M non-zero elements. Oh, and, I only have binary data -- I wasn't sure of the implications with ALS-WR on binary data -- I couldn't find anything to suggest otherwise. I am using data of the format user,item,1 I have read about probabilistic factorization -- which works with binary data -- and perhaps naively, thought ALS-WR was similar so what-the-heck :-) I'd love nothing more than to share the data, however, I'd probably get in some trouble :-) Perhaps I could generate a matrix with a similar distribution? -- I'll have to check on that and see if it is ok #bureaucracy Stay tuned... On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter s...@apache.org wrote: Nicholas, can you give us the detailed arguments you start the job with? I'd especially be interested in
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
I have seen this happen in normal operation when the sorting on the mapper is taking a long long time, because the output is large. You can tell it to increase the timeout. If this is what is happening, you won't have a chance to update a counter as a keep-alive ping, but yes that is generally right otherwise. If this is the case it's that a mapper is outputting a whole lot of info, perhaps 'too much'. I don't know for sure, just another a guess for the pile. On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning ted.dunn...@gmail.com wrote: Status reporting happens automatically when output is generated. In a long computation, it is good form to occasionally update a counter or otherwise indicate that the computation is still progressing. On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff nickkolegr...@gmail.comwrote: Do you know if it should still report status in the midst of a complex task? Seems questionable that it wouldn't just send a friendly hello?
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Nicholas, can you give us the detailed arguments you start the job with? I'd especially be interested in the number of features (--numFeatures) you use. Do you use the job with implicit feedback data (--implicitFeedback=true)? The memory requirements of the job are the following: In each iteration either the item-features matrix (items x features) or the user-features matrix (users x features) is loaded into the memory of each mapper. Then the original user-item matrix (or its transpose) is read row-wise by the mappers and they recompute the features via AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver. --sebastian On 02.02.2012 09:53, Sean Owen wrote: I have seen this happen in normal operation when the sorting on the mapper is taking a long long time, because the output is large. You can tell it to increase the timeout. If this is what is happening, you won't have a chance to update a counter as a keep-alive ping, but yes that is generally right otherwise. If this is the case it's that a mapper is outputting a whole lot of info, perhaps 'too much'. I don't know for sure, just another a guess for the pile. On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning ted.dunn...@gmail.com wrote: Status reporting happens automatically when output is generated. In a long computation, it is good form to occasionally update a counter or otherwise indicate that the computation is still progressing. On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff nickkolegr...@gmail.comwrote: Do you know if it should still report status in the midst of a complex task? Seems questionable that it wouldn't just send a friendly hello?
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
I will up the ante with the time out and report back -- thanks all for the suggestions Hey, Sebastian -- Here are the arguments I am using: --input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda 0.065 When the mapper loads the matrix into memory it only loads the actual non-zero data, correct? Hey Ted -- I messed up on the sparsity. Turns out there are only 70M non-zero elements. Oh, and, I only have binary data -- I wasn't sure of the implications with ALS-WR on binary data -- I couldn't find anything to suggest otherwise. I am using data of the format user,item,1 I have read about probabilistic factorization -- which works with binary data -- and perhaps naively, thought ALS-WR was similar so what-the-heck :-) I'd love nothing more than to share the data, however, I'd probably get in some trouble :-) Perhaps I could generate a matrix with a similar distribution? -- I'll have to check on that and see if it is ok #bureaucracy Stay tuned... On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter s...@apache.org wrote: Nicholas, can you give us the detailed arguments you start the job with? I'd especially be interested in the number of features (--numFeatures) you use. Do you use the job with implicit feedback data (--implicitFeedback=true)? The memory requirements of the job are the following: In each iteration either the item-features matrix (items x features) or the user-features matrix (users x features) is loaded into the memory of each mapper. Then the original user-item matrix (or its transpose) is read row-wise by the mappers and they recompute the features via AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver. --sebastian On 02.02.2012 09:53, Sean Owen wrote: I have seen this happen in normal operation when the sorting on the mapper is taking a long long time, because the output is large. You can tell it to increase the timeout. If this is what is happening, you won't have a chance to update a counter as a keep-alive ping, but yes that is generally right otherwise. If this is the case it's that a mapper is outputting a whole lot of info, perhaps 'too much'. I don't know for sure, just another a guess for the pile. On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning ted.dunn...@gmail.com wrote: Status reporting happens automatically when output is generated. In a long computation, it is good form to occasionally update a counter or otherwise indicate that the computation is still progressing. On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff nickkolegr...@gmail.comwrote: Do you know if it should still report status in the midst of a complex task? Seems questionable that it wouldn't just send a friendly hello?
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Sounds good. Thanks Sebastian The interesting thing is -- I tried to sample the matrix down one time to about 10% of non-zeros -- and worked no problem. On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter s...@apache.org wrote: Your parameters look good, except if you have binary data, you should set --implicitFeedback=true. You could also set numFeatures to a very small value (like 5) just to see if that helps. The mappers load one of the feature matrices into memory which are dense (#items x #features entries or #users x #features entries). Are you sure that the mappers have enough memory for that? It's really strange that you have problems with such small data, I tested this with Netflix ( 100M non-zeros) on a few machines and it worked quite well. --sebastian On 02.02.2012 17:25, Nicholas Kolegraff wrote: I will up the ante with the time out and report back -- thanks all for the suggestions Hey, Sebastian -- Here are the arguments I am using: --input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda 0.065 When the mapper loads the matrix into memory it only loads the actual non-zero data, correct? Hey Ted -- I messed up on the sparsity. Turns out there are only 70M non-zero elements. Oh, and, I only have binary data -- I wasn't sure of the implications with ALS-WR on binary data -- I couldn't find anything to suggest otherwise. I am using data of the format user,item,1 I have read about probabilistic factorization -- which works with binary data -- and perhaps naively, thought ALS-WR was similar so what-the-heck :-) I'd love nothing more than to share the data, however, I'd probably get in some trouble :-) Perhaps I could generate a matrix with a similar distribution? -- I'll have to check on that and see if it is ok #bureaucracy Stay tuned... On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter s...@apache.org wrote: Nicholas, can you give us the detailed arguments you start the job with? I'd especially be interested in the number of features (--numFeatures) you use. Do you use the job with implicit feedback data (--implicitFeedback=true)? The memory requirements of the job are the following: In each iteration either the item-features matrix (items x features) or the user-features matrix (users x features) is loaded into the memory of each mapper. Then the original user-item matrix (or its transpose) is read row-wise by the mappers and they recompute the features via AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver. --sebastian On 02.02.2012 09:53, Sean Owen wrote: I have seen this happen in normal operation when the sorting on the mapper is taking a long long time, because the output is large. You can tell it to increase the timeout. If this is what is happening, you won't have a chance to update a counter as a keep-alive ping, but yes that is generally right otherwise. If this is the case it's that a mapper is outputting a whole lot of info, perhaps 'too much'. I don't know for sure, just another a guess for the pile. On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning ted.dunn...@gmail.com wrote: Status reporting happens automatically when output is generated. In a long computation, it is good form to occasionally update a counter or otherwise indicate that the computation is still progressing. On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff nickkolegr...@gmail.comwrote: Do you know if it should still report status in the midst of a complex task? Seems questionable that it wouldn't just send a friendly hello?
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Hi Nicholas, On Feb 2, 2012, at 10:56am, Nicholas Kolegraff wrote: Ok, I took a bit deeper look into this having changed some parameters and kicked off the new job.. Seems plausible that I didn't have enough memory for some of the mappers -- unless I'm missing something here. An upper bound on the memory would be (assuming my original parameter of 25 features) 8Mil * 25 Features = 200Mil (multiply by 8 bytes assuming double precision floating point) and we get: 1.6billion 1.6B / (1024^3) = ~1.5GB memory needed The tasktracker heapsize and datanode heap sizes were only set to: 1GB The memory you need for this task is based on the mapped.child.java.opts setting (the -Xmx setting), not what's allocated for the NameNode, JobTracker, DataNode or TaskTracker. In fact increasing the DataNode TaskTracker sizes removes memory that could/should be used by the child JVMs that the TaskTracker creates to run your map reduce tasks. Currently it looks like you have 4GB allocated for m2.2xlarge tasks, which should be sufficient given your analysis above. -- Ken So I have changed the bootstrap action on EC2 as follows (this is a diff between the original and the changes I made) # Parameters of the array: # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum] 29c29 m2.2xlarge = [-Xmx4096m, 6, 2], --- m2.2xlarge = [-Xmx8192m, 3, 2], # Parameters of the array (Vars modified in hadoop.env.sh) # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE, HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE] 47c47 m2.2xlarge = [2048, 8192, 1024, 1024], --- m2.2xlarge = [4096, 16384, 2048, 2048] On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter s...@apache.org wrote: Hmm, are you sure that the mappers have enough memory? You can set that via Dmapred.child.java.opts=-Xmx[some number]m --sebastian On 02.02.2012 17:37, Nicholas Kolegraff wrote: Sounds good. Thanks Sebastian The interesting thing is -- I tried to sample the matrix down one time to about 10% of non-zeros -- and worked no problem. On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter s...@apache.org wrote: Your parameters look good, except if you have binary data, you should set --implicitFeedback=true. You could also set numFeatures to a very small value (like 5) just to see if that helps. The mappers load one of the feature matrices into memory which are dense (#items x #features entries or #users x #features entries). Are you sure that the mappers have enough memory for that? It's really strange that you have problems with such small data, I tested this with Netflix ( 100M non-zeros) on a few machines and it worked quite well. --sebastian On 02.02.2012 17:25, Nicholas Kolegraff wrote: I will up the ante with the time out and report back -- thanks all for the suggestions Hey, Sebastian -- Here are the arguments I am using: --input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda 0.065 When the mapper loads the matrix into memory it only loads the actual non-zero data, correct? Hey Ted -- I messed up on the sparsity. Turns out there are only 70M non-zero elements. Oh, and, I only have binary data -- I wasn't sure of the implications with ALS-WR on binary data -- I couldn't find anything to suggest otherwise. I am using data of the format user,item,1 I have read about probabilistic factorization -- which works with binary data -- and perhaps naively, thought ALS-WR was similar so what-the-heck :-) I'd love nothing more than to share the data, however, I'd probably get in some trouble :-) Perhaps I could generate a matrix with a similar distribution? -- I'll have to check on that and see if it is ok #bureaucracy Stay tuned... On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter s...@apache.org wrote: Nicholas, can you give us the detailed arguments you start the job with? I'd especially be interested in the number of features (--numFeatures) you use. Do you use the job with implicit feedback data (--implicitFeedback=true)? The memory requirements of the job are the following: In each iteration either the item-features matrix (items x features) or the user-features matrix (users x features) is loaded into the memory of each mapper. Then the original user-item matrix (or its transpose) is read row-wise by the mappers and they recompute the features via AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver. --sebastian On 02.02.2012 09:53, Sean Owen wrote: I have seen this happen in normal operation when the sorting on the mapper is taking a long long time, because the output is large. You can tell it to increase the timeout. If this is what is happening, you won't have a chance to update a counter as a keep-alive ping, but yes that is generally right otherwise. If this is the case it's that a
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Success! Thanks all!! I changed the --numFeatures option to 5 and it went through no problems, however, the final step of 'SolveExplicitFeedback' took a very long time relative to the others thus, I suspect the other suggestions of changing mapred.task.timeout to something much larger than 600 seconds would also have fixed the issue. (given you have enough memory) On Thu, Feb 2, 2012 at 11:25 AM, Ken Krugler kkrugler_li...@transpac.comwrote: Hi Nicholas, On Feb 2, 2012, at 10:56am, Nicholas Kolegraff wrote: Ok, I took a bit deeper look into this having changed some parameters and kicked off the new job.. Seems plausible that I didn't have enough memory for some of the mappers -- unless I'm missing something here. An upper bound on the memory would be (assuming my original parameter of 25 features) 8Mil * 25 Features = 200Mil (multiply by 8 bytes assuming double precision floating point) and we get: 1.6billion 1.6B / (1024^3) = ~1.5GB memory needed The tasktracker heapsize and datanode heap sizes were only set to: 1GB The memory you need for this task is based on the mapped.child.java.opts setting (the -Xmx setting), not what's allocated for the NameNode, JobTracker, DataNode or TaskTracker. In fact increasing the DataNode TaskTracker sizes removes memory that could/should be used by the child JVMs that the TaskTracker creates to run your map reduce tasks. Currently it looks like you have 4GB allocated for m2.2xlarge tasks, which should be sufficient given your analysis above. -- Ken So I have changed the bootstrap action on EC2 as follows (this is a diff between the original and the changes I made) # Parameters of the array: # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum] 29c29 m2.2xlarge = [-Xmx4096m, 6, 2], --- m2.2xlarge = [-Xmx8192m, 3, 2], # Parameters of the array (Vars modified in hadoop.env.sh) # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE, HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE] 47c47 m2.2xlarge = [2048, 8192, 1024, 1024], --- m2.2xlarge = [4096, 16384, 2048, 2048] On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter s...@apache.org wrote: Hmm, are you sure that the mappers have enough memory? You can set that via Dmapred.child.java.opts=-Xmx[some number]m --sebastian On 02.02.2012 17:37, Nicholas Kolegraff wrote: Sounds good. Thanks Sebastian The interesting thing is -- I tried to sample the matrix down one time to about 10% of non-zeros -- and worked no problem. On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter s...@apache.org wrote: Your parameters look good, except if you have binary data, you should set --implicitFeedback=true. You could also set numFeatures to a very small value (like 5) just to see if that helps. The mappers load one of the feature matrices into memory which are dense (#items x #features entries or #users x #features entries). Are you sure that the mappers have enough memory for that? It's really strange that you have problems with such small data, I tested this with Netflix ( 100M non-zeros) on a few machines and it worked quite well. --sebastian On 02.02.2012 17:25, Nicholas Kolegraff wrote: I will up the ante with the time out and report back -- thanks all for the suggestions Hey, Sebastian -- Here are the arguments I am using: --input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda 0.065 When the mapper loads the matrix into memory it only loads the actual non-zero data, correct? Hey Ted -- I messed up on the sparsity. Turns out there are only 70M non-zero elements. Oh, and, I only have binary data -- I wasn't sure of the implications with ALS-WR on binary data -- I couldn't find anything to suggest otherwise. I am using data of the format user,item,1 I have read about probabilistic factorization -- which works with binary data -- and perhaps naively, thought ALS-WR was similar so what-the-heck :-) I'd love nothing more than to share the data, however, I'd probably get in some trouble :-) Perhaps I could generate a matrix with a similar distribution? -- I'll have to check on that and see if it is ok #bureaucracy Stay tuned... On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter s...@apache.org wrote: Nicholas, can you give us the detailed arguments you start the job with? I'd especially be interested in the number of features (--numFeatures) you use. Do you use the job with implicit feedback data (--implicitFeedback=true)? The memory requirements of the job are the following: In each iteration either the item-features matrix (items x features) or the user-features matrix (users x features) is loaded into the memory of each mapper. Then the original user-item matrix (or its transpose) is read
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Hi, This *may* just be a Hadoop issue - it sounds like the JobTracker is upset that it hasn't heard from one of the workers in too long (over 600 seconds). Can you check your Hadoop Administration pages for the cluster? Does the cluster still seem to be functioning? I haven't used Hadoop with EC2, so I'm not sure how difficult it will be to check the cluster :-/ If everything seems to be OK, there's a Hadoop setting to modify how long it's willing to wait before assuming a machine has failed and killing a task. -Kate On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff nickkolegr...@gmail.com wrote: Hello, I am attempting to run parallelALS on a very large matrix on EC2. The matrix is ~8 Million x 1 million. vary sparse .007% has data. I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge) (I kept getting OutOfMemory exceptions so I kept upping the ante until I arrived at the above configuration) It makes it through the following jobs no problem: ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM Then crashes here with only the following error messages: Task attempt_201201311814_0023_m_00_0 failed to report status for 600 seconds. Killing! Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's status? I'm not sure what is causing this -- I am still trying to wrap my head around the mahout API. Could this still be a memory issue? Hopefully i'm not missing something trivial?!?!
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
So the total size of the data is modest at about 560 M non-zero elements. Total data should be small compared to your node sizes. But the distribution of your data can be important as well. Can you say if you have any rows or columns are extremely dense? On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson eric...@cs.colostate.eduwrote: The matrix is ~8 Million x 1 million. vary sparse .007% has data.
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Thanks for the prompt reply Kate! The cluster has since been torn down on EC2 but, I did monitor it during the job execution and all seemed to be ok. JobTracker and NameNode would continue to report status. I was aware of the configuration setting and hoping to refrain from playing with it :-) I get scared to modify it too large, since that time could get unnecessarily charged to my EC2 account. :S Do you know if it should still report status in the midst of a complex task? Seems questionable that it wouldn't just send a friendly hello? On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson eric...@cs.colostate.eduwrote: Hi, This *may* just be a Hadoop issue - it sounds like the JobTracker is upset that it hasn't heard from one of the workers in too long (over 600 seconds). Can you check your Hadoop Administration pages for the cluster? Does the cluster still seem to be functioning? I haven't used Hadoop with EC2, so I'm not sure how difficult it will be to check the cluster :-/ If everything seems to be OK, there's a Hadoop setting to modify how long it's willing to wait before assuming a machine has failed and killing a task. -Kate On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff nickkolegr...@gmail.com wrote: Hello, I am attempting to run parallelALS on a very large matrix on EC2. The matrix is ~8 Million x 1 million. vary sparse .007% has data. I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge) (I kept getting OutOfMemory exceptions so I kept upping the ante until I arrived at the above configuration) It makes it through the following jobs no problem: ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM Then crashes here with only the following error messages: Task attempt_201201311814_0023_m_00_0 failed to report status for 600 seconds. Killing! Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's status? I'm not sure what is causing this -- I am still trying to wrap my head around the mahout API. Could this still be a memory issue? Hopefully i'm not missing something trivial?!?!
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
If it's thrashing on something, there's a good chance it might miss a checkpoint. Like Ted brought up, there may be some very dense areas of your input causing this problem. How much memory are you giving to your Hadoop workers? The default value is rather small. -Kate On Wed, Feb 1, 2012 at 6:23 PM, Nicholas Kolegraff nickkolegr...@gmail.com wrote: Thanks for the prompt reply Kate! The cluster has since been torn down on EC2 but, I did monitor it during the job execution and all seemed to be ok. JobTracker and NameNode would continue to report status. I was aware of the configuration setting and hoping to refrain from playing with it :-) I get scared to modify it too large, since that time could get unnecessarily charged to my EC2 account. :S Do you know if it should still report status in the midst of a complex task? Seems questionable that it wouldn't just send a friendly hello? On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson eric...@cs.colostate.eduwrote: Hi, This *may* just be a Hadoop issue - it sounds like the JobTracker is upset that it hasn't heard from one of the workers in too long (over 600 seconds). Can you check your Hadoop Administration pages for the cluster? Does the cluster still seem to be functioning? I haven't used Hadoop with EC2, so I'm not sure how difficult it will be to check the cluster :-/ If everything seems to be OK, there's a Hadoop setting to modify how long it's willing to wait before assuming a machine has failed and killing a task. -Kate On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff nickkolegr...@gmail.com wrote: Hello, I am attempting to run parallelALS on a very large matrix on EC2. The matrix is ~8 Million x 1 million. vary sparse .007% has data. I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge) (I kept getting OutOfMemory exceptions so I kept upping the ante until I arrived at the above configuration) It makes it through the following jobs no problem: ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM Then crashes here with only the following error messages: Task attempt_201201311814_0023_m_00_0 failed to report status for 600 seconds. Killing! Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's status? I'm not sure what is causing this -- I am still trying to wrap my head around the mahout API. Could this still be a memory issue? Hopefully i'm not missing something trivial?!?!
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
The most dense row contains about 55K elements of the 1M There are about 5 other rows with about 10K then drops considerably after that for others ~2K I am using the memory-intensive bootstrap action on EC2 -- which bumps heap space for the childs to around 4G, I believe. On Wed, Feb 1, 2012 at 5:44 PM, Ted Dunning ted.dunn...@gmail.com wrote: Status reporting happens automatically when output is generated. In a long computation, it is good form to occasionally update a counter or otherwise indicate that the computation is still progressing. On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff nickkolegr...@gmail.comwrote: Do you know if it should still report status in the midst of a complex task? Seems questionable that it wouldn't just send a friendly hello?