[ https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316393#comment-14316393 ]
DjvuLee commented on SPARK-5739: -------------------------------- Yes, 1M maybe enough for the Kmeans algorithm. But if we consider other machine learning algorithm, such as logistic regression, then 10^7 dimension is not such big. LR in the ad click model in the real maybe common(I ever heard by my friends), so how Spark can deal well with this? Maybe the weight parameter in LR is only one, but when the dimension is up to billion, the data can up to GB. > Size exceeds Integer.MAX_VALUE in File Map > ------------------------------------------ > > Key: SPARK-5739 > URL: https://issues.apache.org/jira/browse/SPARK-5739 > Project: Spark > Issue Type: Bug > Affects Versions: 1.1.1 > Environment: Spark1.1.1 on a cluster with 12 node. Every node with > 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a > node. > Reporter: DjvuLee > > I just run the kmeans algorithm using a random generate data,but occurred > this problem after some iteration. I try several time, and this problem is > reproduced. > Because the data is random generate, so I guess is there a bug ? Or if random > data can lead to such a scenario that the size is bigger than > Integer.MAX_VALUE, can we check the size before using the file map? > 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN > org.apache.spark.util.SizeEstimator - Failed to check whether > UseCompressedOops is set; assuming yes > [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds > Integer.MAX_VALUE > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105) > at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86) > at > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140) > at > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747) > at > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598) > at > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79) > at > org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:68) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270) > at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143) > at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126) > at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338) > at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348) > at KMeansDataGenerator$.main(kmeans.scala:105) > at KMeansDataGenerator.main(kmeans.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) > at java.lang.reflect.Method.invoke(Method.java:619) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org