subject:"ALS failure with size Integer.MAX

Re: ALS failure with size Integer.MAX_VALUE

2014-12-15 Thread Xiangrui Meng

Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui

On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar
reachb...@gmail.com wrote:
Hi Xiangrui,

The block size limit was encountered even with reduced number of item blocks
as you had expected. I'm wondering if I could try the new implementation as
a standalone library against a 1.1 deployment. Does it have dependencies on
any core API's in the current master?

Thanks,
Bharath

On Wed, Dec 3, 2014 at 10:10 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:

Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And
yes, I've been following the JIRA for the new ALS implementation. I'll try
it out when it's ready for testing. .

On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:

Hi Bharath,

You can try setting a small item blocks in this case. 1200 is
definitely too large for ALS. Please try 30 or even smaller. I'm not
sure whether this could solve the problem because you have 100 items
connected with 10^8 users. There is a JIRA for this issue:

https://issues.apache.org/jira/browse/SPARK-3735

which I will try to implement in 1.3. I'll ping you when it is ready.

Best,
Xiangrui

On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Yes, the issue appears to be due to the 2GB block size limitation. I am
hence looking for (user, product) block sizing suggestions to work
around
the block size limitation.

On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

(It won't be that, since you see that the error occur when reading a
block from disk. I think this is an instance of the 2GB block size
limitation.)

On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
ilya.gane...@capitalone.com wrote:
Hi Bharath – I’m unsure if this is your problem but the
MatrixFactorizationModel in MLLIB which is the underlying component
for
ALS
expects your User/Product fields to be integers. Specifically, the
input
to
ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
wondering if
perhaps one of your identifiers exceeds MAX_INT, could you write a
quick
check for that?

I have been running a very similar use case to yours (with more
constrained
hardware resources) and I haven’t seen this exact problem but I’m
sure
we’ve
seen similar issues. Please let me know if you have other questions.

From: Bharath Ravi Kumar reachb...@gmail.com
Date: Thursday, November 27, 2014 at 1:30 PM
To: user@spark.apache.org user@spark.apache.org
Subject: ALS failure with size Integer.MAX_VALUE

We're training a recommender with ALS in mllib 1.1 against a dataset
of
150M
users and 4.5K items, with the total number of training records
being
1.2
Billion (~30GB data). The input data is spread across 1200
partitions on
HDFS. For the training, rank=10, and we've configured {number of
user
data
blocks = number of item data blocks}. The number of user/item blocks
was
varied between 50 to 1200. Irrespective of the block size (e.g. at
1200
blocks each), there are atleast a couple of tasks that end up
shuffle
reading 9.7G each in the aggregate stage (ALS.scala:337) and
failing
with
the following exception:

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
at
org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
at

org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at

org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ALS failure with size Integer.MAX_VALUE

2014-12-15 Thread Bharath Ravi Kumar

Ok. We'll try using it in a test cluster running 1.2.
On 16-Dec-2014 1:36 am, Xiangrui Meng men...@gmail.com wrote:

Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui

On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar
reachb...@gmail.com wrote:
Hi Xiangrui,

The block size limit was encountered even with reduced number of item
blocks
as you had expected. I'm wondering if I could try the new implementation
as
a standalone library against a 1.1 deployment. Does it have dependencies
on
any core API's in the current master?

Thanks,
Bharath

On Wed, Dec 3, 2014 at 10:10 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:

Thanks Xiangrui. I'll try out setting a smaller number of item blocks.
And
yes, I've been following the JIRA for the new ALS implementation. I'll
try
it out when it's ready for testing. .

On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:

Hi Bharath,

https://issues.apache.org/jira/browse/SPARK-3735

which I will try to implement in 1.3. I'll ping you when it is ready.

Best,
Xiangrui

On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com

wrote:
Yes, the issue appears to be due to the 2GB block size limitation. I
am
hence looking for (user, product) block sizing suggestions to work
around
the block size limitation.

On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

(It won't be that, since you see that the error occur when reading a
block from disk. I think this is an instance of the 2GB block size
limitation.)

From: Bharath Ravi Kumar reachb...@gmail.com
Date: Thursday, November 27, 2014 at 1:30 PM
To: user@spark.apache.org user@spark.apache.org
Subject: ALS failure with size Integer.MAX_VALUE

We're training a recommender with ALS in mllib 1.1 against a
dataset
of
150M
users and 4.5K items, with the total number of training records
being
1.2
Billion (~30GB data). The input data is spread across 1200
partitions on
HDFS. For the training, rank=10, and we've configured {number of
user
data
blocks = number of item data blocks}. The number of user/item
blocks
was
varied between 50 to 1200. Irrespective of the block size (e.g. at
1200
blocks each), there are atleast a couple of tasks that end up
shuffle
reading 9.7G each in the aggregate stage (ALS.scala:337) and
failing
with
the following exception:

org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at

org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)

Re: ALS failure with size Integer.MAX_VALUE

2014-12-14 Thread Bharath Ravi Kumar

Hi Xiangrui,

The block size limit was encountered even with reduced number of item
blocks as you had expected. I'm wondering if I could try the new
implementation as a standalone library against a 1.1 deployment. Does it
have dependencies on any core API's in the current master?

Thanks,
Bharath

On Wed, Dec 3, 2014 at 10:10 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:

Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And
yes, I've been following the JIRA for the new ALS implementation. I'll try
it out when it's ready for testing. .

On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:

Hi Bharath,

https://issues.apache.org/jira/browse/SPARK-3735

which I will try to implement in 1.3. I'll ping you when it is ready.

Best,
Xiangrui

On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

(It won't be that, since you see that the error occur when reading a
block from disk. I think this is an instance of the 2GB block size
limitation.)

From: Bharath Ravi Kumar reachb...@gmail.com
Date: Thursday, November 27, 2014 at 1:30 PM
To: user@spark.apache.org user@spark.apache.org
Subject: ALS failure with size Integer.MAX_VALUE

We're training a recommender with ALS in mllib 1.1 against a dataset
of
150M
users and 4.5K items, with the total number of training records being
1.2
Billion (~30GB data). The input data is spread across 1200
partitions on
HDFS. For the training, rank=10, and we've configured {number of user
data
blocks = number of item data blocks}. The number of user/item blocks
was
varied between 50 to 1200. Irrespective of the block size (e.g. at
1200
blocks each), there are atleast a couple of tasks that end up shuffle
reading 9.7G each in the aggregate stage (ALS.scala:337) and
failing
with
the following exception:

org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at

org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)

Re: ALS failure with size Integer.MAX_VALUE

2014-12-03 Thread Bharath Ravi Kumar

Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And
yes, I've been following the JIRA for the new ALS implementation. I'll try
it out when it's ready for testing. .

On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:

Hi Bharath,

https://issues.apache.org/jira/browse/SPARK-3735

which I will try to implement in 1.3. I'll ping you when it is ready.

Best,
Xiangrui

On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

(It won't be that, since you see that the error occur when reading a
block from disk. I think this is an instance of the 2GB block size
limitation.)

I have been running a very similar use case to yours (with more
constrained
hardware resources) and I haven’t seen this exact problem but I’m sure
we’ve
seen similar issues. Please let me know if you have other questions.

From: Bharath Ravi Kumar reachb...@gmail.com
Date: Thursday, November 27, 2014 at 1:30 PM
To: user@spark.apache.org user@spark.apache.org
Subject: ALS failure with size Integer.MAX_VALUE

We're training a recommender with ALS in mllib 1.1 against a dataset
of
150M
users and 4.5K items, with the total number of training records being
1.2
Billion (~30GB data). The input data is spread across 1200 partitions
on
HDFS. For the training, rank=10, and we've configured {number of user
data
blocks = number of item data blocks}. The number of user/item blocks
was
varied between 50 to 1200. Irrespective of the block size (e.g. at
1200
blocks each), there are atleast a couple of tasks that end up shuffle
reading 9.7G each in the aggregate stage (ALS.scala:337) and failing
with
the following exception:

org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at

org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)

Re: ALS failure with size Integer.MAX_VALUE

2014-12-02 Thread Xiangrui Meng

Hi Bharath,

https://issues.apache.org/jira/browse/SPARK-3735

which I will try to implement in 1.3. I'll ping you when it is ready.

Best,
Xiangrui

On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com wrote:
Yes, the issue appears to be due to the 2GB block size limitation. I am
hence looking for (user, product) block sizing suggestions to work around
the block size limitation.

On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

(It won't be that, since you see that the error occur when reading a
block from disk. I think this is an instance of the 2GB block size
limitation.)

On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
ilya.gane...@capitalone.com wrote:
Hi Bharath – I’m unsure if this is your problem but the
MatrixFactorizationModel in MLLIB which is the underlying component for
ALS
expects your User/Product fields to be integers. Specifically, the input
to
ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
wondering if
perhaps one of your identifiers exceeds MAX_INT, could you write a quick
check for that?

I have been running a very similar use case to yours (with more
constrained
hardware resources) and I haven’t seen this exact problem but I’m sure
we’ve
seen similar issues. Please let me know if you have other questions.

From: Bharath Ravi Kumar reachb...@gmail.com
Date: Thursday, November 27, 2014 at 1:30 PM
To: user@spark.apache.org user@spark.apache.org
Subject: ALS failure with size Integer.MAX_VALUE

We're training a recommender with ALS in mllib 1.1 against a dataset of
150M
users and 4.5K items, with the total number of training records being
1.2
Billion (~30GB data). The input data is spread across 1200 partitions on
HDFS. For the training, rank=10, and we've configured {number of user
data
blocks = number of item data blocks}. The number of user/item blocks was
varied between 50 to 1200. Irrespective of the block size (e.g. at 1200
blocks each), there are atleast a couple of tasks that end up shuffle
reading 9.7G each in the aggregate stage (ALS.scala:337) and failing
with
the following exception:

org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at

org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ALS failure with size Integer.MAX_VALUE

2014-12-01 Thread Bharath Ravi Kumar

Yes, the issue appears to be due to the 2GB block size limitation. I am
hence looking for (user, product) block sizing suggestions to work around
the block size limitation.

On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

 (It won't be that, since you see that the error occur when reading a
 block from disk. I think this is an instance of the 2GB block size
 limitation.)

 On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
 ilya.gane...@capitalone.com wrote:
  Hi Bharath – I’m unsure if this is your problem but the
  MatrixFactorizationModel in MLLIB which is the underlying component for
 ALS
  expects your User/Product fields to be integers. Specifically, the input
 to
  ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
 wondering if
  perhaps one of your identifiers exceeds MAX_INT, could you write a quick
  check for that?
 
  I have been running a very similar use case to yours (with more
 constrained
  hardware resources) and I haven’t seen this exact problem but I’m sure
 we’ve
  seen similar issues. Please let me know if you have other questions.
 
  From: Bharath Ravi Kumar reachb...@gmail.com
  Date: Thursday, November 27, 2014 at 1:30 PM
  To: user@spark.apache.org user@spark.apache.org
  Subject: ALS failure with size  Integer.MAX_VALUE
 
  We're training a recommender with ALS in mllib 1.1 against a dataset of
 150M
  users and 4.5K items, with the total number of training records being 1.2
  Billion (~30GB data). The input data is spread across 1200 partitions on
  HDFS. For the training, rank=10, and we've configured {number of user
 data
  blocks = number of item data blocks}. The number of user/item blocks was
  varied  between 50 to 1200. Irrespective of the block size (e.g. at 1200
  blocks each), there are atleast a couple of tasks that end up shuffle
  reading  9.7G each in the aggregate stage (ALS.scala:337) and failing
 with
  the following exception:
 
  java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
  at
 org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
  at
 org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
  at
 
 org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
  at
 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)

Re: ALS failure with size Integer.MAX_VALUE

2014-11-29 Thread Ganelin, Ilya

Hi Bharath – I’m unsure if this is your problem but the 
MatrixFactorizationModel in MLLIB which is the underlying component for ALS 
expects your User/Product fields to be integers. Specifically, the input to ALS 
is an RDD[Rating] and Rating is an (Int, Int, Double). I am wondering if 
perhaps one of your identifiers exceeds MAX_INT, could you write a quick check 
for that?

I have been running a very similar use case to yours (with more constrained 
hardware resources) and I haven’t seen this exact problem but I’m sure we’ve 
seen similar issues. Please let me know if you have other questions.

From: Bharath Ravi Kumar reachb...@gmail.commailto:reachb...@gmail.com
Date: Thursday, November 27, 2014 at 1:30 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: ALS failure with size  Integer.MAX_VALUE

We're training a recommender with ALS in mllib 1.1 against a dataset of 150M 
users and 4.5K items, with the total number of training records being 1.2 
Billion (~30GB data). The input data is spread across 1200 partitions on HDFS. 
For the training, rank=10, and we've configured {number of user data blocks = 
number of item data blocks}. The number of user/item blocks was varied  between 
50 to 1200. Irrespective of the block size (e.g. at 1200 blocks each), there 
are atleast a couple of tasks that end up shuffle reading  9.7G each in the 
aggregate stage (ALS.scala:337) and failing with the following exception:

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
at 
org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at 
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)



As for the data, on the user side, the degree of a node in the connectivity 
graph is relatively small. However, on the item side, 3.8K out of the 4.5K 
items are connected to 10^5 users each on an average, with 100 items being 
connected to nearly 10^8 users. The rest of the items are connected to less 
than 10^5 users. With such a skew in the connectivity graph, I'm unsure if 
additional memory or variation in the block sizes would help (considering my 
limited understanding of the implementation in mllib). Any suggestion to 
address the problem?


The test is being run on a standalone cluster of 3 hosts, each with 100G RAM  
24 cores dedicated to the application. The additional configs I made specific 
to the shuffle and task failure reduction are as follows:

spark.core.connection.ack.wait.timeout=600
spark.shuffle.consolidateFiles=true
spark.shuffle.manager=SORT


The job execution summary is as follows:

Active Stages:

Stage id 2,  aggregate at ALS.scala:337, duration 55 min, Tasks 1197/1200 (3 
failed), Shuffle Read :  141.6 GB

Completed Stages (5)
Stage IdDescriptionDuration
Tasks: Succeeded/TotalInputShuffle ReadShuffle Write
6org.apache.spark.rdd.RDD.flatMap(RDD.scala:277) 12 min 
1200/120029.9 GB1668.4 MB186.8 GB

5mapPartitionsWithIndex at ALS.scala:250 +details

3map at ALS.scala:231

0aggregate at ALS.scala:337 +details

1map at ALS.scala:228 +details


Thanks,
Bharath


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: ALS failure with size Integer.MAX_VALUE

2014-11-28 Thread Bharath Ravi Kumar

Any suggestions to address the described problem? In particular, it appears
that considering the skewed degree of some of the item nodes in the graph,
I believe it should be possible to define better block sizes to reflect
that fact, but am unsure of the way of arriving at the sizes accordingly.

Thanks,
Bharath

On Fri, Nov 28, 2014 at 12:00 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:

 We're training a recommender with ALS in mllib 1.1 against a dataset of
 150M users and 4.5K items, with the total number of training records being
 1.2 Billion (~30GB data). The input data is spread across 1200 partitions
 on HDFS. For the training, rank=10, and we've configured {number of user
 data blocks = number of item data blocks}. The number of user/item blocks
 was varied  between 50 to 1200. Irrespective of the block size (e.g. at
 1200 blocks each), there are atleast a couple of tasks that end up shuffle
 reading  9.7G each in the aggregate stage (ALS.scala:337) and failing with
 the following exception:

 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
 at
 org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
 at
 org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
 at
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)




 As for the data, on the user side, the degree of a node in the
 connectivity graph is relatively small. However, on the item side, 3.8K out
 of the 4.5K items are connected to 10^5 users each on an average, with 100
 items being connected to nearly 10^8 users. The rest of the items are
 connected to less than 10^5 users. With such a skew in the connectivity
 graph, I'm unsure if additional memory or variation in the block sizes
 would help (considering my limited understanding of the implementation in
 mllib). Any suggestion to address the problem?


 The test is being run on a standalone cluster of 3 hosts, each with 100G
 RAM  24 cores dedicated to the application. The additional configs I made
 specific to the shuffle and task failure reduction are as follows:

 spark.core.connection.ack.wait.timeout=600
 spark.shuffle.consolidateFiles=true
 spark.shuffle.manager=SORT


 The job execution summary is as follows:

 Active Stages:

 Stage id 2,  aggregate at ALS.scala:337, duration 55 min, Tasks 1197/1200
 (3 failed), Shuffle Read :  141.6 GB

 Completed Stages (5)
 Stage IdDescriptionDuration
 Tasks: Succeeded/TotalInputShuffle ReadShuffle Write
 6org.apache.spark.rdd.RDD.flatMap(RDD.scala:277) 12 min
  1200/120029.9 GB1668.4 MB186.8 GB

 5mapPartitionsWithIndex at ALS.scala:250 +details

 3map at ALS.scala:231

 0aggregate at ALS.scala:337 +details

 1map at ALS.scala:228 +details


 Thanks,
 Bharath

ALS failure with size Integer.MAX_VALUE

2014-11-27 Thread Bharath Ravi Kumar

We're training a recommender with ALS in mllib 1.1 against a dataset of
150M users and 4.5K items, with the total number of training records being
1.2 Billion (~30GB data). The input data is spread across 1200 partitions
on HDFS. For the training, rank=10, and we've configured {number of user
data blocks = number of item data blocks}. The number of user/item blocks
was varied  between 50 to 1200. Irrespective of the block size (e.g. at
1200 blocks each), there are atleast a couple of tasks that end up shuffle
reading  9.7G each in the aggregate stage (ALS.scala:337) and failing with
the following exception:

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
at
org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)




As for the data, on the user side, the degree of a node in the connectivity
graph is relatively small. However, on the item side, 3.8K out of the 4.5K
items are connected to 10^5 users each on an average, with 100 items being
connected to nearly 10^8 users. The rest of the items are connected to less
than 10^5 users. With such a skew in the connectivity graph, I'm unsure if
additional memory or variation in the block sizes would help (considering
my limited understanding of the implementation in mllib). Any suggestion to
address the problem?


The test is being run on a standalone cluster of 3 hosts, each with 100G
RAM  24 cores dedicated to the application. The additional configs I made
specific to the shuffle and task failure reduction are as follows:

spark.core.connection.ack.wait.timeout=600
spark.shuffle.consolidateFiles=true
spark.shuffle.manager=SORT


The job execution summary is as follows:

Active Stages:

Stage id 2,  aggregate at ALS.scala:337, duration 55 min, Tasks 1197/1200
(3 failed), Shuffle Read :  141.6 GB

Completed Stages (5)
Stage IdDescriptionDuration
Tasks: Succeeded/TotalInputShuffle ReadShuffle Write
6org.apache.spark.rdd.RDD.flatMap(RDD.scala:277) 12 min
1200/120029.9 GB1668.4 MB186.8 GB

5mapPartitionsWithIndex at ALS.scala:250 +details

3map at ALS.scala:231

0aggregate at ALS.scala:337 +details

1map at ALS.scala:228 +details


Thanks,
Bharath

Re: ALS failure with size Integer.MAX_VALUE

Re: ALS failure with size Integer.MAX_VALUE

Re: ALS failure with size Integer.MAX_VALUE

Re: ALS failure with size Integer.MAX_VALUE

Re: ALS failure with size Integer.MAX_VALUE

Re: ALS failure with size Integer.MAX_VALUE

Re: ALS failure with size Integer.MAX_VALUE

Re: ALS failure with size Integer.MAX_VALUE

ALS failure with size Integer.MAX_VALUE

9 matches

Site Navigation

Mail list logo

Footer information