[jira] [Commented] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2017-02-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887730#comment-15887730
 ] 

Apache Spark commented on SPARK-16440:
--

User 'AnthonyTruchet' has created a pull request for this issue:
https://github.com/apache/spark/pull/14299

> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>Assignee: Anthony Truchet
> Fix For: 1.6.3, 2.0.1
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2016-07-21 Thread Anthony Truchet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387342#comment-15387342
 ] 

Anthony Truchet commented on SPARK-16440:
-

Regarding the try finally: we are computing numerous learning from within a 
same spark context and some with vocabulary so large that they fail (yes we do 
try to filter out too big ones, but too big is difficult to define).

So we are in a context where we do care about resource cleaning in case of 
error in order to enable thousands of successive learnings some of with 
expected to fail.

As for core readability we can try to refactor the function to reduce the 
nesting or find a "nice" scala solution: I'll propose a patch and I'll welcome 
any feedback on it.

> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>Assignee: Anthony Truchet
> Fix For: 1.6.3, 2.0.1
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2016-07-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384608#comment-15384608
 ] 

Apache Spark commented on SPARK-16440:
--

User 'AnthonyTruchet' has created a pull request for this issue:
https://github.com/apache/spark/pull/14268

> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>Assignee: Sean Owen
> Fix For: 1.6.3, 2.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2016-07-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384131#comment-15384131
 ] 

Sean Owen commented on SPARK-16440:
---

Yeah it seems good to destroy even in case of errors, but in practice, an error 
here means lots of things are wrong. Actually using try-finally to destroy 
every RDD/variable would make the code a mess. I think many cases where it 
plausibly won't matter, we don't. If there's a decent argument that errors here 
are common for some reason, OK, but not sure that's true.

> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>Assignee: Sean Owen
> Fix For: 1.6.3, 2.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2016-07-19 Thread Anthony Truchet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384118#comment-15384118
 ] 

Anthony Truchet commented on SPARK-16440:
-

I will, as well as putting this is a try finally to ensure proper deletion even 
in case of errors.

> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>Assignee: Sean Owen
> Fix For: 1.6.3, 2.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2016-07-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383971#comment-15383971
 ] 

Sean Owen commented on SPARK-16440:
---

Oh, may be better still indeed. Feel free to submit a follow up associated to 
this same JIRA. 

> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>Assignee: Sean Owen
> Fix For: 1.6.3, 2.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2016-07-19 Thread Anthony Truchet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383967#comment-15383967
 ] 

Anthony Truchet commented on SPARK-16440:
-

Thanks for such a quick fix [~srowen] : I was off-line for the past week that's 
why I couldn't submit the patch quickly enough.

I would have {{destroy}}ed the variable instead of {{unpersist}}ing them though 
as the issues was memory consumption on the driver side: what am I missing 
which made you choose the later over the former ?

> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>Assignee: Sean Owen
> Fix For: 1.6.3, 2.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2016-07-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15372781#comment-15372781
 ] 

Apache Spark commented on SPARK-16440:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14153

> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2016-07-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367616#comment-15367616
 ] 

Sean Owen commented on SPARK-16440:
---

Yeah it would be fine to unpersist these at the end of the method. 
I suppose I'm surprised that the Broadcast vars don't unpersist themselves when 
they're out of scope and garbage collected via finalize?

> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2016-07-08 Thread Anthony Truchet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367615#comment-15367615
 ] 

Anthony Truchet commented on SPARK-16440:
-

Hello Spark developers,

I'm preparing a patch for this issue. This will be my first contribution to 
Spark. I'll strive to follow the contribution guidelines, but please do not 
hesitate to tell me how to do it better if required :-)



> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org