[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-10-29 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hi @szhem.

I'm sorry I haven't been more responsive here. I can relate to your 
frustration, and I do want to help you make progress on this PR and merge it 
in. I have indeed been busy with other responsibilities, but I can rededicate 
time to reviewing this PR. 

Of all the approaches you've proposed so far, I like the 
`ContextCleaner`-based one the best. Personally, I'm okay with setting 
`spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by default for the 
next major Spark release and documenting this change of behavior in the release 
notes. However, that may not be okay with the senior maintainers. As an 
alternative I wonder if we could instead create a new config just for graph RDD 
checkpoint cleaning such as 
`spark.cleaner.referenceTracking.cleanGraphCheckpoints` and set that to `true` 
by default. Then use that config value in `PeriodicGraphCheckpointer` instead 
of `spark.cleaner.referenceTracking.cleanCheckpoints`.

Would you be willing to open another PR with your `ContextCleaner`-based 
approach? I'm not suggesting you close this PR. We can call each PR alternative 
solutions for the same JIRA ticket and cross-reference each PR. If you do that 
then I will try to debug the problem with the retained checkpoints.

Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19410
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-09-30 Thread szhem
Github user szhem commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hello @mallman, @sujithjay, @felixcheung, @jkbradley, @mengxr, it's already 
about a year passed since this pull request has been opened.
I'm just wondering whether there is any chance to get any feedback for this 
PR (understanding that all of you have a little or probably no time having your 
own more important activities) and get it either rejected or merged?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-09-03 Thread szhem
Github user szhem commented on the issue:

https://github.com/apache/spark/pull/19410
  
I've tested the mentioned checkpointers with 
`spark.cleaner.referenceTracking.cleanCheckpoints` set to `true` and without 
explicit checkpoint files removal.

It seems that there are somewhere hard references remained - old checkpoint 
files are not deleted at all and it seems that ContextCleaner.doCleanCheckpoint 
is never called.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-09-01 Thread szhem
Github user szhem commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hi @asolimando,

I believe that the solution with weak references will work and probably 
with `ContextCleaner` too, but there are some points I'd like to discuss if you 
don't mind

- Let's take `ContextCleaner` first. In that case we should have a pretty 
simple solution, but the backward compatibility of `PeriodicRDDCheckpointer` 
and `PeriodicGraphCheckpointer` will be lost, because 
  - it will be necessary to update these classes to prevent deleting 
checkpoint files 
  - user will always have to provide 
`spark.cleaner.referenceTracking.cleanCheckpoints` property in order to clean 
unnecessary checkpoints. 
  - the users who haven't specified 
`spark.cleaner.referenceTracking.cleanCheckpoints` previously (and I believe 
there will be most of them) will be affected by this new unexpected behaviour 

- In case of custom solution based on weak references 
  - it will be necessary to poll a reference queue at some place and moment 
to remove unnecessary checkpoint files. 
  - polling the reference queue in the separate thread will complicate the 
code 
  - polling the reference queue synchronously does not guarantee deleting 
all the unnecessary checkpoint files.

In case of reference queue, could you please recommend the convenient place 
in the source code to do it? 

As for me 
- setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by 
default should work 
- setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by 
default will allow us to simplify `PeriodicRDDCheckpointer` and 
`PeriodicGraphCheckpointer` too by deleting unnecessary code that cleans up 
checkpoint files
- setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by 
default sounds reasonable and should not break the code of those users who 
didn't do it previously
- but setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` 
will probably break the backward compatibility (although this PR tries to 
preserve it)

What do you think? Will the community accept setting 
`spark.cleaner.referenceTracking.cleanCheckpoints` to `true`by default?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-07-11 Thread EthanRock
Github user EthanRock commented on the issue:

https://github.com/apache/spark/pull/19410
  
I have tried to set graph's storage level to StorageLevel.MEMORY_AND_DISK 
in my case and the error still happens.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-07-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hi @szhem.

I understand you've put a lot of work into this implementation, however I 
think you should try a simpler approach before we consider something more 
complicated. I believe an approach based on weak references and a reference 
queue would be a much simpler alternative. Can you give that a try?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-07-09 Thread szhem
Github user szhem commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hi @mallman,

I believe, that `ContextCleaner` currently does not delete checkpoint data 
it case of unexpected failures.
Also as it works at the end of the job then there is still a chance that a 
job processing quite a big graph with a lot of iterations can influence other 
running jobs by consuming a lot of disk during its run.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-07-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hi @szhem.

Thanks for the information regarding disk use for your scenario. What do 
you think about my second point, using the `ContextCleaner`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-07-09 Thread EthanRock
Github user EthanRock commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hi, I met the same problem today.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-07-03 Thread szhem
Github user szhem commented on the issue:

https://github.com/apache/spark/pull/19410
  
@mallman 

Just my two cents regarding built-in solutions:

Periodic checkpointer deletes checkpoint files not to pollute the hard 
drive. Although disk storage is cheap it's not free. 

For example, in my case (graph with >1B vertices and about the same amount 
of edges) checkpoint directory with a single checkpoint took about 150-200GB. 
Checkpoint interval was set to 5, and then job was able to complete in 
about 100 iterations.
So in case of not cleaning up unnecessary checkpoints, the checkpoint 
directory could grow up to 6TB (which is quite a lot) in my case.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-07-02 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hi @szhem. I dug deeper and think I understand the problem better.

To state the obvious, the periodic checkpointer deletes checkpoint files of 
RDDs that are potentially still accessible. In fact, that's the problem here. 
It deletes the checkpoint files of an RDD that's later used.

The algorithm being used to find checkpoint files that can be "safely" 
deleted is flawed, and this PR aims to fix that.

I have a few thoughts from this.

1. Why does the periodic checkpointer delete checkpoint files? I absolutely 
understand the preciousness of cache memory and wanting to keep the cache as 
clean as possible, but this has nothing to do with that. We're talking about 
deleting files from disk storage. I'm making some assumptions, like we're using 
a filesystem that's not backed by RAM, but disk storage is dirt cheap these 
days. Why can't we just let the user delete the checkpoint files themselves?

1.a. Can we and should we support a mode where the automatic deletion of 
checkpoint files is an option (with a warning of potential failures)? To 
maintain backwards compatibility, we set this option to true by default, but 
"power" users can set this value to false to do the cleanup themselves and 
ensure the checkpointer doesn't delete files it shouldn't.

2. I think the JVM gives us a built-in solution for the automatic and safe 
deletion of checkpoint files, and the `ContextCleaner` does just that (and 
more). Can we leverage that functionality?

What do you think? @felixcheung or @viirya, can you weigh in on this, 
please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-06-27 Thread szhem
Github user szhem commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hi @mallman!

In case of 
```
StorageLevel.MEMORY_AND_DISK
StorageLevel.MEMORY_AND_DISK_SER_2
```
... tests pass.

They still fail in case of 
```
StorageLevel.MEMORY_ONLY
StorageLevel.MEMORY_ONLY_SER
```
Although it works, I'm not sure that changing the caching level of the 
graph is really a good option to go with as Spark starts complaining 
[here](https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/graphx/src/main/scala/org/apache/spark/graphx/impl/VertexPartitionBaseOps.scala#L111)
 and 
[here](https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/graphx/src/main/scala/org/apache/spark/graphx/impl/VertexPartitionBaseOps.scala#L131)
 
```
18/06/27 16:08:46.802 Executor task launch worker for task 3 WARN 
ShippableVertexPartitionOps: Diffing two VertexPartitions with different 
indexes is slow.
18/06/27 16:08:47.000 Executor task launch worker for task 4 WARN 
ShippableVertexPartitionOps: Diffing two VertexPartitions with different 
indexes is slow.
18/06/27 16:08:47.164 Executor task launch worker for task 5 WARN 
ShippableVertexPartitionOps: Diffing two VertexPartitions with different 
indexes is slow.
18/06/27 16:08:48.724 Executor task launch worker for task 18 WARN 
ShippableVertexPartitionOps: Joining two VertexPartitions with different 
indexes is slow.
18/06/27 16:08:48.749 Executor task launch worker for task 18 WARN 
ShippableVertexPartitionOps: Diffing two VertexPartitions with different 
indexes is slow.
18/06/27 16:08:48.868 Executor task launch worker for task 19 WARN 
ShippableVertexPartitionOps: Joining two VertexPartitions with different 
indexes is slow.
18/06/27 16:08:48.899 Executor task launch worker for task 19 WARN 
ShippableVertexPartitionOps: Diffing two VertexPartitions with different 
indexes is slow.
18/06/27 16:08:49.008 Executor task launch worker for task 20 WARN 
ShippableVertexPartitionOps: Joining two VertexPartitions with different 
indexes is slow.
18/06/27 16:08:49.028 Executor task launch worker for task 20 WARN 
ShippableVertexPartitionOps: Diffing two VertexPartitions with different 
indexes is slow.
```


P.S. To emulate the lack of memory I just set the following options like 
[here](https://github.com/apache/spark/pull/19410/files?utf8=%E2%9C%93=unified#diff-c2823ca69af75fc6cdfd1ebbf25c2aefR85)
 to emulate lack of memory resources.
```
spark.testing.reservedMemory
spark.testing.memory
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-06-26 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hi @szhem. Thanks for the kind reminder and thanks for your contribution. 
I'm sorry I did not respond sooner.

I no longer work where I regularly used the checkpointing code with large 
graphs. And I don't have access to any similar graph to test with now. I'm 
somewhat hamstrung by that limitation. That being said, I'll do my best to help.

With respect to the failure you're seeing, can you tell me what happens if 
you set your graph's storage level to `StorageLevel.MEMORY_AND_DISK` or 
`StorageLevel.MEMORY_AND_DISK_SER_2` without applying this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-06-25 Thread szhem
Github user szhem commented on the issue:

https://github.com/apache/spark/pull/19410
  
Just a kind remainder...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19410
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-03-29 Thread szhem
Github user szhem commented on the issue:

https://github.com/apache/spark/pull/19410
  
Hello @viirya, @mallman, @felixcheung,

You were reviewing graph checkpointing, introduced here #15125, and this PR 
changes the behaviour a little bit.

Could you please review this PR too if possible?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19410
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2018-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19410
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2017-12-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19410
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2017-10-16 Thread szhem
Github user szhem commented on the issue:

https://github.com/apache/spark/pull/19410
  
I would happy if anyone can take a look at this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

2017-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19410
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org