[ 
https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316393#comment-14316393
 ] 

DjvuLee commented on SPARK-5739:
--------------------------------

Yes, 1M maybe enough for the Kmeans algorithm.

But if we consider other machine learning algorithm, such as  logistic 
regression, then 10^7 dimension is not such big. LR  in the ad click model in 
the real maybe common(I ever heard by my friends), so how Spark can deal well 
with this? 

Maybe the weight parameter in LR is only one, but when the dimension is up to 
billion, the data can up to GB.

> Size exceeds Integer.MAX_VALUE in File Map
> ------------------------------------------
>
>                 Key: SPARK-5739
>                 URL: https://issues.apache.org/jira/browse/SPARK-5739
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.1.1
>         Environment: Spark1.1.1 on a cluster with 12 node. Every node with 
> 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a 
> node.
>            Reporter: DjvuLee
>
> I just run the kmeans algorithm using a random generate data,but occurred 
> this problem after some iteration. I try several time, and this problem is 
> reproduced. 
> Because the data is random generate, so I guess is there a bug ? Or if random 
> data can lead to such a scenario that the size is bigger than 
> Integer.MAX_VALUE, can we check the size before using the file map?
> 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
> org.apache.spark.util.SizeEstimator - Failed to check whether 
> UseCompressedOops is set; assuming yes
> [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds 
> Integer.MAX_VALUE
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>       at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850)
>       at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>       at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86)
>       at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140)
>       at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105)
>       at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747)
>       at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598)
>       at 
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:68)
>       at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
>       at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
>       at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
>       at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
>       at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270)
>       at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
>       at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
>       at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338)
>       at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348)
>       at KMeansDataGenerator$.main(kmeans.scala:105)
>       at KMeansDataGenerator.main(kmeans.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
>       at java.lang.reflect.Method.invoke(Method.java:619)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to