[jira] [Created] (SPARK-43297) Make improvement to LocalKMeans

2023-04-26 Thread wenweijian (Jira)
wenweijian created SPARK-43297:
--

 Summary: Make improvement to LocalKMeans
 Key: SPARK-43297
 URL: https://issues.apache.org/jira/browse/SPARK-43297
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 3.3.0
Reporter: wenweijian


There are two initializationMode in Kmeans, random mode and parallel mode.

The ParallelMode is using kmeansPlusPlus to generate the centers point, but the 
kMeansPlusPlus is a local method which runs in the driver.

If the scale of points is huge, the kMeansPlusPlus will run for a long time, 
because it is a single thread method running in the driiver.

We can make this method run in parallel to make it faster, such as using 
Parallel collections. 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2022-07-28 Thread wenweijian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572536#comment-17572536
 ] 

wenweijian commented on SPARK-34788:


I can reproduce it. 

If the performance impact of the proposal (SyncFailedException) is little, I 
think it is worth fixing

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users:
> {code:java}
> 9/03/26 09:03:45 ERROR ShuffleBlockFetcherIterator: Failed to create input 
> stream from local block
> java.io.IOException: Error in reading 
> FileSegmentManagedBuffer{file=/local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data,
>  offset=110254956, length=1875458}
>   at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:111)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:442)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:622)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:98)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:95)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:839)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:839)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
>   at org.apache.spark.scheduler.Task.run(Task.scala:112)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: 
> /local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data
>  (No such file or directory)
>   at java.io.FileInputStream.open0(Native Method)
>   at 

[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2022-07-27 Thread wenweijian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572212#comment-17572212
 ] 

wenweijian commented on SPARK-34788:


I got this FileNotFoundException when the 
hdfs(/tmp/logs/root/bucket-logs-tfile) disk is full.

when I delele some file in that dir, the exception disappeared. 

 

logs:
{code:java}
org.apache.spark.shuffle.FetchFailedException: Error in reading 
FileSegmentManagedBuffer[file=/home/install/hadoop/data2/hadoop-3.3.0/nm-local-dir/usercache/root/appcache/application_1658221318180_0002/blockmgr-001c6d1a-bda3-4f8e-8998-42e3a8fbaaa9/0d/shuffle_0_7622_0.data,offset=57062119,length=26794]
    at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:770)
    at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:649)
    at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
    at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
    at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200)
    at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:128)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Error in reading 
FileSegmentManagedBuffer[file=/home/install/hadoop/data2/hadoop-3.3.0/nm-local-dir/usercache/root/appcache/application_1658221318180_0002/blockmgr-001c6d1a-bda3-4f8e-8998-42e3a8fbaaa9/0d/shuffle_0_7622_0.data,offset=57062119,length=26794]
    at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:112)
    at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:637)
    ... 23 more
Caused by: java.io.FileNotFoundException: 
/home/install/hadoop/data2/hadoop-3.3.0/nm-local-dir/usercache/root/appcache/application_1658221318180_0002/blockmgr-001c6d1a-bda3-4f8e-8998-42e3a8fbaaa9/0d/shuffle_0_7622_0.data
 (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.(FileInputStream.java:138)
    at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:101)
    ... 24 more {code}

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users:
> {code:java}
> 9/03/26 09:03:45 ERROR ShuffleBlockFetcherIterator: Failed to create input 
> stream from local block
> java.io.IOException: Error in reading 
> FileSegmentManagedBuffer{file=/local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data,
>  offset=110254956, length=1875458}
>   at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:111)
>   at 
>