[jira] [Closed] (SPARK-10091) PySpark Streaming doesn't support Context recovery from checkpoint in HDFS

2015-09-03 Thread Stanislav Los (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Los closed SPARK-10091.
-
Resolution: Duplicate

> PySpark Streaming doesn't support Context recovery from checkpoint in HDFS
> --
>
> Key: SPARK-10091
> URL: https://issues.apache.org/jira/browse/SPARK-10091
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
>Reporter: Stanislav Los
>
> Apparently, PySpark Streaming application can't recover from a checkpoint 
> that was stored in HDFS. Method StreamingContext.getOrCreate looks for an 
> existing checkpoint at local file system only. 
> {code:title=pyspark.streaming.StreamingContext.getOrCreate()}
> if not os.path.exists(checkpointPath) or not os.listdir(checkpointPath):
>ssc = setupFunc()
>ssc.checkpoint(checkpointPath)
>return ssc
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10091) PySpark Streaming doesn't support Context recovery from checkpoint in HDFS

2015-08-18 Thread Stanislav Los (JIRA)
Stanislav Los created SPARK-10091:
-

 Summary: PySpark Streaming doesn't support Context recovery from 
checkpoint in HDFS
 Key: SPARK-10091
 URL: https://issues.apache.org/jira/browse/SPARK-10091
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Affects Versions: 1.4.1, 1.4.0, 1.3.1, 1.3.0
Reporter: Stanislav Los


Apparently, PySpark Streaming application can't recover from a checkpoint that 
was stored in HDFS. Method StreamingContext.getOrCreate looks for an existing 
checkpoint at local file system only. 

{code:title=pyspark.streaming.StreamingContext.getOrCreate()}
if not os.path.exists(checkpointPath) or not os.listdir(checkpointPath):
   ssc = setupFunc()
   ssc.checkpoint(checkpointPath)
   return ssc
{code}

In case if this is taken care of, next piece of code fails anyway (data is 
present in hdfs):

{code:title=pyspark.streaming.StreamingContext.getOrCreate()}
try:
   jssc = gw.jvm.JavaStreamingContext(checkpointPath)
except Exception:
   print >>sys.stderr, "failed to load StreamingContext from checkpoint"
{code}
raise



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10091) PySpark Streaming doesn't support Context recovery from checkpoint in HDFS

2015-08-18 Thread Stanislav Los (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Los updated SPARK-10091:
--
Description: 
Apparently, PySpark Streaming application can't recover from a checkpoint that 
was stored in HDFS. Method StreamingContext.getOrCreate looks for an existing 
checkpoint at local file system only. 

{code:title=pyspark.streaming.StreamingContext.getOrCreate()}
if not os.path.exists(checkpointPath) or not os.listdir(checkpointPath):
   ssc = setupFunc()
   ssc.checkpoint(checkpointPath)
   return ssc
{code}

  was:
Apparently, PySpark Streaming application can't recover from a checkpoint that 
was stored in HDFS. Method StreamingContext.getOrCreate looks for an existing 
checkpoint at local file system only. 

{code:title=pyspark.streaming.StreamingContext.getOrCreate()}
if not os.path.exists(checkpointPath) or not os.listdir(checkpointPath):
   ssc = setupFunc()
   ssc.checkpoint(checkpointPath)
   return ssc
{code}

In case if this is taken care of, next piece of code fails anyway (data is 
present in hdfs):

{code:title=pyspark.streaming.StreamingContext.getOrCreate()}
try:
   jssc = gw.jvm.JavaStreamingContext(checkpointPath)
except Exception:
   print >>sys.stderr, "failed to load StreamingContext from checkpoint"
{code}
raise


> PySpark Streaming doesn't support Context recovery from checkpoint in HDFS
> --
>
> Key: SPARK-10091
> URL: https://issues.apache.org/jira/browse/SPARK-10091
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
>Reporter: Stanislav Los
>
> Apparently, PySpark Streaming application can't recover from a checkpoint 
> that was stored in HDFS. Method StreamingContext.getOrCreate looks for an 
> existing checkpoint at local file system only. 
> {code:title=pyspark.streaming.StreamingContext.getOrCreate()}
> if not os.path.exists(checkpointPath) or not os.listdir(checkpointPath):
>ssc = setupFunc()
>ssc.checkpoint(checkpointPath)
>return ssc
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5657) Add PySpark Avro Output Format example

2015-02-06 Thread Stanislav Los (JIRA)
Stanislav Los created SPARK-5657:


 Summary: Add PySpark Avro Output Format example
 Key: SPARK-5657
 URL: https://issues.apache.org/jira/browse/SPARK-5657
 Project: Spark
  Issue Type: Improvement
Reporter: Stanislav Los


There is an Avro Input Format example that shows how to read Avro data in 
PySpark, but nothing shows how to write from PySpark to Avro. The main 
challenge, a Converter needs an Avro schema to build a record, but current 
Spark API doesn't provide a way to supply extra parameters to custom 
converters. Provided workaround is possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5657) Add PySpark Avro Output Format example

2015-02-06 Thread Stanislav Los (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Los updated SPARK-5657:
-
  Component/s: PySpark
   Examples
Affects Version/s: 1.2.0

> Add PySpark Avro Output Format example
> --
>
> Key: SPARK-5657
> URL: https://issues.apache.org/jira/browse/SPARK-5657
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, PySpark
>Affects Versions: 1.2.0
>Reporter: Stanislav Los
>
> There is an Avro Input Format example that shows how to read Avro data in 
> PySpark, but nothing shows how to write from PySpark to Avro. The main 
> challenge, a Converter needs an Avro schema to build a record, but current 
> Spark API doesn't provide a way to supply extra parameters to custom 
> converters. Provided workaround is possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-10-11 Thread Stanislav Los (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646723#comment-16646723
 ] 

Stanislav Los commented on SPARK-23904:
---

[~igreenfi] [~RBerenguel] we had the same issue and I found easier solution to 
it (without need of altering Spark code). See below. I also updated 
stackoverflow.

We faced the same issue, and solution is to set parameter 
"spark.sql.ui.retainedExecutions" to lower value, for example --conf 
"spark.sql.ui.retainedExecutions=10" 
By default it's 1000.


It keeps instances count of 
org.apache.spark.sql.execution.ui.SQLExecutionUIData low enough.
SQLExecutionUIData have a reference to physicalPlanDescription, which can get 
very big.
In our case we had to read huge avro messages from Kafka with lot's of fields, 
and plan description was in the area of 8mg each.

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-10-11 Thread Stanislav Los (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646769#comment-16646769
 ] 

Stanislav Los commented on SPARK-23904:
---

for [~igreenfi] case I'd think setting parameter to zero would work, then plan 
description will never be created in the first place

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org