[jira] [Created] (SPARK-22518) Make default cache storage level configurable

2017-11-14 Thread Rares Mirica (JIRA)
Rares Mirica created SPARK-22518:


 Summary: Make default cache storage level configurable
 Key: SPARK-22518
 URL: https://issues.apache.org/jira/browse/SPARK-22518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Rares Mirica
Priority: Minor


Caching defaults to the hard-coded value MEMORY_ONLY, and as most users call 
the convenient .cache() method this value is not configurable in a global way. 
Please make this configurable through a spark config option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions

2017-05-30 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029393#comment-16029393
 ] 

Rares Mirica commented on SPARK-20920:
--

Yes, as per my comment, it's a related but distinct problem, I am hoping it's 
easily solvable by moving the ForkJoinPool into the companion object of the 
case class so that a single one is maintained

> ForkJoinPool pools are leaked when writing hive tables with many partitions
> ---
>
> Key: SPARK-20920
> URL: https://issues.apache.org/jira/browse/SPARK-20920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Rares Mirica
>
> This bug is loosely related to SPARK-17396
> In this case it happens when writing to a hive table with many, many, 
> partitions (my table is partitioned by hour and stores data it gets from 
> kafka in a spark streaming application):
> df.repartition()
>   .write
>   .format("orc")
>   .option("path", s"$tablesStoragePath/$tableName")
>   .mode(SaveMode.Append)
>   .partitionBy("dt", "hh")
>   .saveAsTable(tableName)
> As this table grows beyond a certain size, ForkJoinPool pools start leaking. 
> Upon examination (with a debugger) I found that the caller is 
> AlterTableRecoverPartitionsCommand and the problem happens when 
> `evalTaskSupport` is used (line 555). I have tried setting a very large 
> threshold via `spark.rdd.parallelListingThreshold` and the problem went away.
> My assumption is that the problem happens in this case and not in the one in 
> SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case 
> class while UnionRDD is an object so multiple instances are not possible, 
> therefore no leak.
> Regards,
> Rares



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions

2017-05-30 Thread Rares Mirica (JIRA)
Rares Mirica created SPARK-20920:


 Summary: ForkJoinPool pools are leaked when writing hive tables 
with many partitions
 Key: SPARK-20920
 URL: https://issues.apache.org/jira/browse/SPARK-20920
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Rares Mirica


This bug is loosely related to SPARK-17396

In this case it happens when writing to a hive table with many, many, 
partitions (my table is partitioned by hour and stores data it gets from kafka 
in a spark streaming application):

df.repartition()
  .write
  .format("orc")
  .option("path", s"$tablesStoragePath/$tableName")
  .mode(SaveMode.Append)
  .partitionBy("dt", "hh")
  .saveAsTable(tableName)

As this table grows beyond a certain size, ForkJoinPool pools start leaking. 
Upon examination (with a debugger) I found that the caller is 
AlterTableRecoverPartitionsCommand and the problem happens when 
`evalTaskSupport` is used (line 555). I have tried setting a very large 
threshold via `spark.rdd.parallelListingThreshold` and the problem went away.

My assumption is that the problem happens in this case and not in the one in 
SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case 
class while UnionRDD is an object so multiple instances are not possible, 
therefore no leak.

Regards,
Rares



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes

2016-03-21 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203912#comment-15203912
 ] 

Rares Mirica commented on SPARK-12072:
--

Thank you for looking into this, testing asap

> python dataframe ._jdf.schema().json() breaks on large metadata dataframes
> --
>
> Key: SPARK-12072
> URL: https://issues.apache.org/jira/browse/SPARK-12072
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Rares Mirica
>
> When a dataframe contains a column with a large number of values in ml_attr, 
> schema evaluation will routinely fail on getting the schema as json, this 
> will, in turn, cause a bunch of problems with, eg: calling udfs on the schema 
> because calling columns relies on 
> _parse_datatype_json_string(self._jdf.schema().json())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10413) Model should support prediction on single instance

2016-02-09 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138756#comment-15138756
 ] 

Rares Mirica edited comment on SPARK-10413 at 2/9/16 10:35 AM:
---

I don't know if I am reading this right but many times the processing pipelines 
contain a relatively large number of stages, supporting single instance on 
strong types means that the pipeline will need to be split into the 
column-manipulation stages (run over a dataframe, eg: PolynomialExpansion 
creates a column that is then used as the feature for prediction) and the 
single-instance run (in this case the prediction on a model). 

Supporting single Row instance would open the way for local execution of an 
entire pipeline (presumably loaded from storage) which opens up applications in 
the low-latency space (online prediction with rest front-end for example)


was (Author: mrares):
I don't know if I am reading this right but many times the processing pipelines 
contain a relatively large number of stages, supporting single instance on 
string types means that the pipeline will need to be split into the 
column-manipulation stages (run over a dataframe, eg: PolynomialExpansion 
creates a column that is then used as the feature for prediction) and the 
single-instance run (in this case the prediction on a model). 

Supporting single Row instance would open the way for local execution of an 
entire pipeline (presumably loaded from storage) which opens up applications in 
the low-latency space (online prediction with rest front-end for example)

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10413) Model should support prediction on single instance

2016-02-09 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138756#comment-15138756
 ] 

Rares Mirica commented on SPARK-10413:
--

I don't know if I am reading this right but many times the processing pipelines 
contain a relatively large number of stages, supporting single instance on 
string types means that the pipeline will need to be split into the 
column-manipulation stages (run over a dataframe, eg: PolynomialExpansion 
creates a column that is then used as the feature for prediction) and the 
single-instance run (in this case the prediction on a model). 

Supporting single Row instance would open the way for local execution of an 
entire pipeline (presumably loaded from storage) which opens up applications in 
the low-latency space (online prediction with rest front-end for example)

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11888) Model export/import for spark.ml: DecisionTreeClassifier,Regressor

2016-01-06 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085670#comment-15085670
 ] 

Rares Mirica commented on SPARK-11888:
--

Is there any chance this will be released in another minor version of the 1.6 
branch?

> Model export/import for spark.ml: DecisionTreeClassifier,Regressor
> --
>
> Key: SPARK-11888
> URL: https://issues.apache.org/jira/browse/SPARK-11888
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Partly done, but going to skip 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12408) Spark 1.6 with tachyon 0.8.2 uses deprecated client

2015-12-29 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073905#comment-15073905
 ] 

Rares Mirica commented on SPARK-12408:
--

Thank you for the reply, this makes sense. 

> Spark 1.6 with tachyon 0.8.2 uses deprecated client
> ---
>
> Key: SPARK-12408
> URL: https://issues.apache.org/jira/browse/SPARK-12408
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Rares Mirica
>Priority: Minor
>
> The future release of spark, 1.6 uses the deprecated TachyonFS API, imho this 
> should be avoided, I don't know if this should fall on the spark backlog or 
> tachyon but this is up to you.
> Related: https://tachyon.atlassian.net/browse/TACHYON-1429



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12408) Spark 1.6 with tachyon 0.8.2 uses deprecated client

2015-12-17 Thread Rares Mirica (JIRA)
Rares Mirica created SPARK-12408:


 Summary: Spark 1.6 with tachyon 0.8.2 uses deprecated client
 Key: SPARK-12408
 URL: https://issues.apache.org/jira/browse/SPARK-12408
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Rares Mirica


The future release of spark, 1.6 uses the deprecated TachyonFS API, imho this 
should be avoided, I don't know if this should fall on the spark backlog or 
tachyon but this is up to you.

Related: https://tachyon.atlassian.net/browse/TACHYON-1429



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041789#comment-15041789
 ] 

Rares Mirica commented on SPARK-12147:
--

Yes, I am talking about the executor stopping as part of scaling down on
dynamic allocation. I am observing this in am actual test, I was reading
the docs just to test my assumption.



> Off heap storage and dynamicAllocation operation
> 
>
> Key: SPARK-12147
> URL: https://issues.apache.org/jira/browse/SPARK-12147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
> Tachyon 0.7.1
> Yarn
>Reporter: Rares Mirica
>Priority: Minor
> Attachments: spark-defaults.conf
>
>
> For the purpose of increasing computation density and efficiency I set up to 
> test off-heap storage (using Tachyon) with dynamicAllocation enabled.
> Following the available documentation (programming-guide for Spark 1.5.2) I 
> was expecting data to be cached in Tachyon for the lifetime of the 
> application (driver instance) or until unpersist() is called. This belief was 
> supported by the doc: "Cached data is not lost if individual executors 
> crash." where with crash I also assimilate Graceful Decommission. 
> Furthermore, in the GD description documented in the job-scheduling document 
> cached data preservation through off-heap storage is also hinted at.
> Seeing how Tachyon is now in a state where these promises of a better future 
> are well within reach, I consider it a bug that upon graceful decommission of 
> an executor the off-heap data is deleted (presumably as part of the cleanup 
> phase).
> Needless to say, enabling the preservation of the off-heap persisted data 
> after graceful decommission for dynamic allocation would yield significant 
> improvements in resource allocation, especially over yarn where executors use 
> up compute "slots" even if idle. After a long, expensive, computation where 
> we take advantage of the dynamically scaled executors, the rest of the spark 
> jobs can use the cached data while releasing the compute resources for other 
> cluster tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041683#comment-15041683
 ] 

Rares Mirica commented on SPARK-12147:
--

I would also like to object on setting this as minor, this is a massive 
improvement in usability of Spark in multi-tenant environments or 
interactive-use environments where a large number of executors is needed to 
prepare an RDD for later use (eg: exploratory research) and caching is needed 
to avoid resource waste.

The only alternative is to permanently persist the RDD, the api for which is 
quite a bit more complicated and also puts the responsibility of cleaning and 
maintaining the data on the shoulders of the user (instead of treating the data 
as ephemeral and only available for the lifetime of the current application).

> Off heap storage and dynamicAllocation operation
> 
>
> Key: SPARK-12147
> URL: https://issues.apache.org/jira/browse/SPARK-12147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
> Tachyon 0.7.1
> Yarn
>Reporter: Rares Mirica
>Priority: Minor
> Attachments: spark-defaults.conf
>
>
> For the purpose of increasing computation density and efficiency I set up to 
> test off-heap storage (using Tachyon) with dynamicAllocation enabled.
> Following the available documentation (programming-guide for Spark 1.5.2) I 
> was expecting data to be cached in Tachyon for the lifetime of the 
> application (driver instance) or until unpersist() is called. This belief was 
> supported by the doc: "Cached data is not lost if individual executors 
> crash." where with crash I also assimilate Graceful Decommission. 
> Furthermore, in the GD description documented in the job-scheduling document 
> cached data preservation through off-heap storage is also hinted at.
> Seeing how Tachyon is now in a state where these promises of a better future 
> are well within reach, I consider it a bug that upon graceful decommission of 
> an executor the off-heap data is deleted (presumably as part of the cleanup 
> phase).
> Needless to say, enabling the preservation of the off-heap persisted data 
> after graceful decommission for dynamic allocation would yield significant 
> improvements in resource allocation, especially over yarn where executors use 
> up compute "slots" even if idle. After a long, expensive, computation where 
> we take advantage of the dynamically scaled executors, the rest of the spark 
> jobs can use the cached data while releasing the compute resources for other 
> cluster tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041679#comment-15041679
 ] 

Rares Mirica commented on SPARK-12147:
--

Sorry, I wasn't specific enough about the use-case and how to trigger/take 
advantage of this.

There is no need to cache data in the traditional sense (by calling .cache() on 
the RDD) so no on-heap space is required. One only needs to append 
.persist(OFF_HEAP) after the computation to take advantage of this. All of the 
data should therefore reside in OFF-HEAP storage (for the time being this is 
Tachyon). There is no alternative off-heap implementation so for taking 
advantage of this Tachyon is required, the only alternative would be to 
serialise the result of the expensive computation to disk (through a .saveX 
call) and then re-load the RDD through sparkContext.textFile (or equivalent, 
using parquet, java serialised objects).

The data should only live in one place: tachyon, and should be considered 
persisted (as it would through serialising and saving to hdfs) for the lifetime 
of the application. If this would be the case death or decommission of an 
executor would be completely decoupled from the data originatin in that 
executor and "cached" in tacyhon.

> Off heap storage and dynamicAllocation operation
> 
>
> Key: SPARK-12147
> URL: https://issues.apache.org/jira/browse/SPARK-12147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
> Tachyon 0.7.1
> Yarn
>Reporter: Rares Mirica
>Priority: Minor
> Attachments: spark-defaults.conf
>
>
> For the purpose of increasing computation density and efficiency I set up to 
> test off-heap storage (using Tachyon) with dynamicAllocation enabled.
> Following the available documentation (programming-guide for Spark 1.5.2) I 
> was expecting data to be cached in Tachyon for the lifetime of the 
> application (driver instance) or until unpersist() is called. This belief was 
> supported by the doc: "Cached data is not lost if individual executors 
> crash." where with crash I also assimilate Graceful Decommission. 
> Furthermore, in the GD description documented in the job-scheduling document 
> cached data preservation through off-heap storage is also hinted at.
> Seeing how Tachyon is now in a state where these promises of a better future 
> are well within reach, I consider it a bug that upon graceful decommission of 
> an executor the off-heap data is deleted (presumably as part of the cleanup 
> phase).
> Needless to say, enabling the preservation of the off-heap persisted data 
> after graceful decommission for dynamic allocation would yield significant 
> improvements in resource allocation, especially over yarn where executors use 
> up compute "slots" even if idle. After a long, expensive, computation where 
> we take advantage of the dynamically scaled executors, the rest of the spark 
> jobs can use the cached data while releasing the compute resources for other 
> cluster tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Rares Mirica (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rares Mirica updated SPARK-12147:
-
Attachment: spark-defaults.conf

> Off heap storage and dynamicAllocation operation
> 
>
> Key: SPARK-12147
> URL: https://issues.apache.org/jira/browse/SPARK-12147
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
> Tachyon 0.7.1
> Yarn
>Reporter: Rares Mirica
> Attachments: spark-defaults.conf
>
>
> For the purpose of increasing computation density and efficiency I set up to 
> test off-heap storage (using Tachyon) with dynamicAllocation enabled.
> Following the available documentation (programming-guide for Spark 1.5.2) I 
> was expecting data to be cached in Tachyon for the lifetime of the 
> application (driver instance) or until unpersist() is called. This belief was 
> supported by the doc: "Cached data is not lost if individual executors 
> crash." where with crash I also assimilate Graceful Decommission. 
> Furthermore, in the GD description documented in the job-scheduling document 
> cached data preservation through off-heap storage is also hinted at.
> Seeing how Tachyon is now in a state where these promises of a better future 
> are well within reach, I consider it a bug that upon graceful decommission of 
> an executor the off-heap data is deleted (presumably as part of the cleanup 
> phase).
> Needless to say, enabling the preservation of the off-heap persisted data 
> after graceful decommission for dynamic allocation would yield significant 
> improvements in resource allocation, especially over yarn where executors use 
> up compute "slots" even if idle. After a long, expensive, computation where 
> we take advantage of the dynamically scaled executors, the rest of the spark 
> jobs can use the cached data while releasing the compute resources for other 
> cluster tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12147) Off heap storage and dynamicAllocation operation

2015-12-04 Thread Rares Mirica (JIRA)
Rares Mirica created SPARK-12147:


 Summary: Off heap storage and dynamicAllocation operation
 Key: SPARK-12147
 URL: https://issues.apache.org/jira/browse/SPARK-12147
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
 Environment: Cloudera Hadoop 2.6.0-cdh5.4.8
Tachyon 0.7.1
Yarn
Reporter: Rares Mirica


For the purpose of increasing computation density and efficiency I set up to 
test off-heap storage (using Tachyon) with dynamicAllocation enabled.

Following the available documentation (programming-guide for Spark 1.5.2) I was 
expecting data to be cached in Tachyon for the lifetime of the application 
(driver instance) or until unpersist() is called. This belief was supported by 
the doc: "Cached data is not lost if individual executors crash." where with 
crash I also assimilate Graceful Decommission. Furthermore, in the GD 
description documented in the job-scheduling document cached data preservation 
through off-heap storage is also hinted at.

Seeing how Tachyon is now in a state where these promises of a better future 
are well within reach, I consider it a bug that upon graceful decommission of 
an executor the off-heap data is deleted (presumably as part of the cleanup 
phase).

Needless to say, enabling the preservation of the off-heap persisted data after 
graceful decommission for dynamic allocation would yield significant 
improvements in resource allocation, especially over yarn where executors use 
up compute "slots" even if idle. After a long, expensive, computation where we 
take advantage of the dynamically scaled executors, the rest of the spark jobs 
can use the cached data while releasing the compute resources for other cluster 
tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes

2015-12-02 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035823#comment-15035823
 ] 

Rares Mirica commented on SPARK-12072:
--

My set is in the millions of parameters. I believe you are right, the schema 
should be accessible in a round-about way with minimum serialisation. I realise 
this would be one of those "add another layer of abstraction" solutions that 
might not be a good idea but the current state means that dataframes combined 
with some of the transformers in the pipeline api simply don't scale for python 
at least.

> python dataframe ._jdf.schema().json() breaks on large metadata dataframes
> --
>
> Key: SPARK-12072
> URL: https://issues.apache.org/jira/browse/SPARK-12072
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Rares Mirica
>
> When a dataframe contains a column with a large number of values in ml_attr, 
> schema evaluation will routinely fail on getting the schema as json, this 
> will, in turn, cause a bunch of problems with, eg: calling udfs on the schema 
> because calling columns relies on 
> _parse_datatype_json_string(self._jdf.schema().json())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11352) codegen.GeneratePredicate fails due to unquoted comment

2015-12-01 Thread Rares Mirica (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033480#comment-15033480
 ] 

Rares Mirica commented on SPARK-11352:
--

Hi, I am sorry, am no longer able to find the original problem. is it possible 
to simply parse the strings at the point where the comment is added (though I 
don't see why you would even leave a comment there in the generated production 
code) and strip any problematic character?

> codegen.GeneratePredicate fails due to unquoted comment
> ---
>
> Key: SPARK-11352
> URL: https://issues.apache.org/jira/browse/SPARK-11352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Rares Mirica
>
> Somehow the code being generated ends up having comments with 
> comment-terminators unquoted, eg.:
> /* ((input[35, StringType] <= 
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8) && 
> (text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 <= input[36, 
> StringType])) */
> with emphasis on ... =0.9,*/...
> This leads to a org.codehaus.commons.compiler.CompileException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes

2015-12-01 Thread Rares Mirica (JIRA)
Rares Mirica created SPARK-12072:


 Summary: python dataframe ._jdf.schema().json() breaks on large 
metadata dataframes
 Key: SPARK-12072
 URL: https://issues.apache.org/jira/browse/SPARK-12072
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.2
Reporter: Rares Mirica


When a dataframe contains a column with a large number of values in ml_attr, 
schema evaluation will routinely fail on getting the schema as json, this will, 
in turn, cause a bunch of problems with, eg: calling udfs on the schema because 
calling columns relies on _parse_datatype_json_string(self._jdf.schema().json())




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11352) codegen.GeneratePredicate fails due to unquoted comment

2015-10-27 Thread Rares Mirica (JIRA)
Rares Mirica created SPARK-11352:


 Summary: codegen.GeneratePredicate fails due to unquoted comment
 Key: SPARK-11352
 URL: https://issues.apache.org/jira/browse/SPARK-11352
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Rares Mirica


Somehow the code being generated ends up having comments with 
comment-terminators unquoted, eg.:

/* ((input[35, StringType] <= 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8) && 
(text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 <= input[36, 
StringType])) */

with emphasis on ... =0.9,*/...

This leads to a org.codehaus.commons.compiler.CompileException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org