[jira] [Created] (SPARK-22518) Make default cache storage level configurable
Rares Mirica created SPARK-22518: Summary: Make default cache storage level configurable Key: SPARK-22518 URL: https://issues.apache.org/jira/browse/SPARK-22518 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Rares Mirica Priority: Minor Caching defaults to the hard-coded value MEMORY_ONLY, and as most users call the convenient .cache() method this value is not configurable in a global way. Please make this configurable through a spark config option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions
[ https://issues.apache.org/jira/browse/SPARK-20920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029393#comment-16029393 ] Rares Mirica commented on SPARK-20920: -- Yes, as per my comment, it's a related but distinct problem, I am hoping it's easily solvable by moving the ForkJoinPool into the companion object of the case class so that a single one is maintained > ForkJoinPool pools are leaked when writing hive tables with many partitions > --- > > Key: SPARK-20920 > URL: https://issues.apache.org/jira/browse/SPARK-20920 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Rares Mirica > > This bug is loosely related to SPARK-17396 > In this case it happens when writing to a hive table with many, many, > partitions (my table is partitioned by hour and stores data it gets from > kafka in a spark streaming application): > df.repartition() > .write > .format("orc") > .option("path", s"$tablesStoragePath/$tableName") > .mode(SaveMode.Append) > .partitionBy("dt", "hh") > .saveAsTable(tableName) > As this table grows beyond a certain size, ForkJoinPool pools start leaking. > Upon examination (with a debugger) I found that the caller is > AlterTableRecoverPartitionsCommand and the problem happens when > `evalTaskSupport` is used (line 555). I have tried setting a very large > threshold via `spark.rdd.parallelListingThreshold` and the problem went away. > My assumption is that the problem happens in this case and not in the one in > SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case > class while UnionRDD is an object so multiple instances are not possible, > therefore no leak. > Regards, > Rares -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions
Rares Mirica created SPARK-20920: Summary: ForkJoinPool pools are leaked when writing hive tables with many partitions Key: SPARK-20920 URL: https://issues.apache.org/jira/browse/SPARK-20920 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: Rares Mirica This bug is loosely related to SPARK-17396 In this case it happens when writing to a hive table with many, many, partitions (my table is partitioned by hour and stores data it gets from kafka in a spark streaming application): df.repartition() .write .format("orc") .option("path", s"$tablesStoragePath/$tableName") .mode(SaveMode.Append) .partitionBy("dt", "hh") .saveAsTable(tableName) As this table grows beyond a certain size, ForkJoinPool pools start leaking. Upon examination (with a debugger) I found that the caller is AlterTableRecoverPartitionsCommand and the problem happens when `evalTaskSupport` is used (line 555). I have tried setting a very large threshold via `spark.rdd.parallelListingThreshold` and the problem went away. My assumption is that the problem happens in this case and not in the one in SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case class while UnionRDD is an object so multiple instances are not possible, therefore no leak. Regards, Rares -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes
[ https://issues.apache.org/jira/browse/SPARK-12072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203912#comment-15203912 ] Rares Mirica commented on SPARK-12072: -- Thank you for looking into this, testing asap > python dataframe ._jdf.schema().json() breaks on large metadata dataframes > -- > > Key: SPARK-12072 > URL: https://issues.apache.org/jira/browse/SPARK-12072 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Rares Mirica > > When a dataframe contains a column with a large number of values in ml_attr, > schema evaluation will routinely fail on getting the schema as json, this > will, in turn, cause a bunch of problems with, eg: calling udfs on the schema > because calling columns relies on > _parse_datatype_json_string(self._jdf.schema().json()) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10413) Model should support prediction on single instance
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138756#comment-15138756 ] Rares Mirica edited comment on SPARK-10413 at 2/9/16 10:35 AM: --- I don't know if I am reading this right but many times the processing pipelines contain a relatively large number of stages, supporting single instance on strong types means that the pipeline will need to be split into the column-manipulation stages (run over a dataframe, eg: PolynomialExpansion creates a column that is then used as the feature for prediction) and the single-instance run (in this case the prediction on a model). Supporting single Row instance would open the way for local execution of an entire pipeline (presumably loaded from storage) which opens up applications in the low-latency space (online prediction with rest front-end for example) was (Author: mrares): I don't know if I am reading this right but many times the processing pipelines contain a relatively large number of stages, supporting single instance on string types means that the pipeline will need to be split into the column-manipulation stages (run over a dataframe, eg: PolynomialExpansion creates a column that is then used as the feature for prediction) and the single-instance run (in this case the prediction on a model). Supporting single Row instance would open the way for local execution of an entire pipeline (presumably loaded from storage) which opens up applications in the low-latency space (online prediction with rest front-end for example) > Model should support prediction on single instance > -- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10413) Model should support prediction on single instance
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138756#comment-15138756 ] Rares Mirica commented on SPARK-10413: -- I don't know if I am reading this right but many times the processing pipelines contain a relatively large number of stages, supporting single instance on string types means that the pipeline will need to be split into the column-manipulation stages (run over a dataframe, eg: PolynomialExpansion creates a column that is then used as the feature for prediction) and the single-instance run (in this case the prediction on a model). Supporting single Row instance would open the way for local execution of an entire pipeline (presumably loaded from storage) which opens up applications in the low-latency space (online prediction with rest front-end for example) > Model should support prediction on single instance > -- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11888) Model export/import for spark.ml: DecisionTreeClassifier,Regressor
[ https://issues.apache.org/jira/browse/SPARK-11888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085670#comment-15085670 ] Rares Mirica commented on SPARK-11888: -- Is there any chance this will be released in another minor version of the 1.6 branch? > Model export/import for spark.ml: DecisionTreeClassifier,Regressor > -- > > Key: SPARK-11888 > URL: https://issues.apache.org/jira/browse/SPARK-11888 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Partly done, but going to skip 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12408) Spark 1.6 with tachyon 0.8.2 uses deprecated client
[ https://issues.apache.org/jira/browse/SPARK-12408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073905#comment-15073905 ] Rares Mirica commented on SPARK-12408: -- Thank you for the reply, this makes sense. > Spark 1.6 with tachyon 0.8.2 uses deprecated client > --- > > Key: SPARK-12408 > URL: https://issues.apache.org/jira/browse/SPARK-12408 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Rares Mirica >Priority: Minor > > The future release of spark, 1.6 uses the deprecated TachyonFS API, imho this > should be avoided, I don't know if this should fall on the spark backlog or > tachyon but this is up to you. > Related: https://tachyon.atlassian.net/browse/TACHYON-1429 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12408) Spark 1.6 with tachyon 0.8.2 uses deprecated client
Rares Mirica created SPARK-12408: Summary: Spark 1.6 with tachyon 0.8.2 uses deprecated client Key: SPARK-12408 URL: https://issues.apache.org/jira/browse/SPARK-12408 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.0 Reporter: Rares Mirica The future release of spark, 1.6 uses the deprecated TachyonFS API, imho this should be avoided, I don't know if this should fall on the spark backlog or tachyon but this is up to you. Related: https://tachyon.atlassian.net/browse/TACHYON-1429 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation
[ https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041789#comment-15041789 ] Rares Mirica commented on SPARK-12147: -- Yes, I am talking about the executor stopping as part of scaling down on dynamic allocation. I am observing this in am actual test, I was reading the docs just to test my assumption. > Off heap storage and dynamicAllocation operation > > > Key: SPARK-12147 > URL: https://issues.apache.org/jira/browse/SPARK-12147 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 > Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 > Tachyon 0.7.1 > Yarn >Reporter: Rares Mirica >Priority: Minor > Attachments: spark-defaults.conf > > > For the purpose of increasing computation density and efficiency I set up to > test off-heap storage (using Tachyon) with dynamicAllocation enabled. > Following the available documentation (programming-guide for Spark 1.5.2) I > was expecting data to be cached in Tachyon for the lifetime of the > application (driver instance) or until unpersist() is called. This belief was > supported by the doc: "Cached data is not lost if individual executors > crash." where with crash I also assimilate Graceful Decommission. > Furthermore, in the GD description documented in the job-scheduling document > cached data preservation through off-heap storage is also hinted at. > Seeing how Tachyon is now in a state where these promises of a better future > are well within reach, I consider it a bug that upon graceful decommission of > an executor the off-heap data is deleted (presumably as part of the cleanup > phase). > Needless to say, enabling the preservation of the off-heap persisted data > after graceful decommission for dynamic allocation would yield significant > improvements in resource allocation, especially over yarn where executors use > up compute "slots" even if idle. After a long, expensive, computation where > we take advantage of the dynamically scaled executors, the rest of the spark > jobs can use the cached data while releasing the compute resources for other > cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation
[ https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041683#comment-15041683 ] Rares Mirica commented on SPARK-12147: -- I would also like to object on setting this as minor, this is a massive improvement in usability of Spark in multi-tenant environments or interactive-use environments where a large number of executors is needed to prepare an RDD for later use (eg: exploratory research) and caching is needed to avoid resource waste. The only alternative is to permanently persist the RDD, the api for which is quite a bit more complicated and also puts the responsibility of cleaning and maintaining the data on the shoulders of the user (instead of treating the data as ephemeral and only available for the lifetime of the current application). > Off heap storage and dynamicAllocation operation > > > Key: SPARK-12147 > URL: https://issues.apache.org/jira/browse/SPARK-12147 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 > Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 > Tachyon 0.7.1 > Yarn >Reporter: Rares Mirica >Priority: Minor > Attachments: spark-defaults.conf > > > For the purpose of increasing computation density and efficiency I set up to > test off-heap storage (using Tachyon) with dynamicAllocation enabled. > Following the available documentation (programming-guide for Spark 1.5.2) I > was expecting data to be cached in Tachyon for the lifetime of the > application (driver instance) or until unpersist() is called. This belief was > supported by the doc: "Cached data is not lost if individual executors > crash." where with crash I also assimilate Graceful Decommission. > Furthermore, in the GD description documented in the job-scheduling document > cached data preservation through off-heap storage is also hinted at. > Seeing how Tachyon is now in a state where these promises of a better future > are well within reach, I consider it a bug that upon graceful decommission of > an executor the off-heap data is deleted (presumably as part of the cleanup > phase). > Needless to say, enabling the preservation of the off-heap persisted data > after graceful decommission for dynamic allocation would yield significant > improvements in resource allocation, especially over yarn where executors use > up compute "slots" even if idle. After a long, expensive, computation where > we take advantage of the dynamically scaled executors, the rest of the spark > jobs can use the cached data while releasing the compute resources for other > cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12147) Off heap storage and dynamicAllocation operation
[ https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041679#comment-15041679 ] Rares Mirica commented on SPARK-12147: -- Sorry, I wasn't specific enough about the use-case and how to trigger/take advantage of this. There is no need to cache data in the traditional sense (by calling .cache() on the RDD) so no on-heap space is required. One only needs to append .persist(OFF_HEAP) after the computation to take advantage of this. All of the data should therefore reside in OFF-HEAP storage (for the time being this is Tachyon). There is no alternative off-heap implementation so for taking advantage of this Tachyon is required, the only alternative would be to serialise the result of the expensive computation to disk (through a .saveX call) and then re-load the RDD through sparkContext.textFile (or equivalent, using parquet, java serialised objects). The data should only live in one place: tachyon, and should be considered persisted (as it would through serialising and saving to hdfs) for the lifetime of the application. If this would be the case death or decommission of an executor would be completely decoupled from the data originatin in that executor and "cached" in tacyhon. > Off heap storage and dynamicAllocation operation > > > Key: SPARK-12147 > URL: https://issues.apache.org/jira/browse/SPARK-12147 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 > Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 > Tachyon 0.7.1 > Yarn >Reporter: Rares Mirica >Priority: Minor > Attachments: spark-defaults.conf > > > For the purpose of increasing computation density and efficiency I set up to > test off-heap storage (using Tachyon) with dynamicAllocation enabled. > Following the available documentation (programming-guide for Spark 1.5.2) I > was expecting data to be cached in Tachyon for the lifetime of the > application (driver instance) or until unpersist() is called. This belief was > supported by the doc: "Cached data is not lost if individual executors > crash." where with crash I also assimilate Graceful Decommission. > Furthermore, in the GD description documented in the job-scheduling document > cached data preservation through off-heap storage is also hinted at. > Seeing how Tachyon is now in a state where these promises of a better future > are well within reach, I consider it a bug that upon graceful decommission of > an executor the off-heap data is deleted (presumably as part of the cleanup > phase). > Needless to say, enabling the preservation of the off-heap persisted data > after graceful decommission for dynamic allocation would yield significant > improvements in resource allocation, especially over yarn where executors use > up compute "slots" even if idle. After a long, expensive, computation where > we take advantage of the dynamically scaled executors, the rest of the spark > jobs can use the cached data while releasing the compute resources for other > cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12147) Off heap storage and dynamicAllocation operation
[ https://issues.apache.org/jira/browse/SPARK-12147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rares Mirica updated SPARK-12147: - Attachment: spark-defaults.conf > Off heap storage and dynamicAllocation operation > > > Key: SPARK-12147 > URL: https://issues.apache.org/jira/browse/SPARK-12147 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 > Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 > Tachyon 0.7.1 > Yarn >Reporter: Rares Mirica > Attachments: spark-defaults.conf > > > For the purpose of increasing computation density and efficiency I set up to > test off-heap storage (using Tachyon) with dynamicAllocation enabled. > Following the available documentation (programming-guide for Spark 1.5.2) I > was expecting data to be cached in Tachyon for the lifetime of the > application (driver instance) or until unpersist() is called. This belief was > supported by the doc: "Cached data is not lost if individual executors > crash." where with crash I also assimilate Graceful Decommission. > Furthermore, in the GD description documented in the job-scheduling document > cached data preservation through off-heap storage is also hinted at. > Seeing how Tachyon is now in a state where these promises of a better future > are well within reach, I consider it a bug that upon graceful decommission of > an executor the off-heap data is deleted (presumably as part of the cleanup > phase). > Needless to say, enabling the preservation of the off-heap persisted data > after graceful decommission for dynamic allocation would yield significant > improvements in resource allocation, especially over yarn where executors use > up compute "slots" even if idle. After a long, expensive, computation where > we take advantage of the dynamically scaled executors, the rest of the spark > jobs can use the cached data while releasing the compute resources for other > cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12147) Off heap storage and dynamicAllocation operation
Rares Mirica created SPARK-12147: Summary: Off heap storage and dynamicAllocation operation Key: SPARK-12147 URL: https://issues.apache.org/jira/browse/SPARK-12147 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2 Environment: Cloudera Hadoop 2.6.0-cdh5.4.8 Tachyon 0.7.1 Yarn Reporter: Rares Mirica For the purpose of increasing computation density and efficiency I set up to test off-heap storage (using Tachyon) with dynamicAllocation enabled. Following the available documentation (programming-guide for Spark 1.5.2) I was expecting data to be cached in Tachyon for the lifetime of the application (driver instance) or until unpersist() is called. This belief was supported by the doc: "Cached data is not lost if individual executors crash." where with crash I also assimilate Graceful Decommission. Furthermore, in the GD description documented in the job-scheduling document cached data preservation through off-heap storage is also hinted at. Seeing how Tachyon is now in a state where these promises of a better future are well within reach, I consider it a bug that upon graceful decommission of an executor the off-heap data is deleted (presumably as part of the cleanup phase). Needless to say, enabling the preservation of the off-heap persisted data after graceful decommission for dynamic allocation would yield significant improvements in resource allocation, especially over yarn where executors use up compute "slots" even if idle. After a long, expensive, computation where we take advantage of the dynamically scaled executors, the rest of the spark jobs can use the cached data while releasing the compute resources for other cluster tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes
[ https://issues.apache.org/jira/browse/SPARK-12072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035823#comment-15035823 ] Rares Mirica commented on SPARK-12072: -- My set is in the millions of parameters. I believe you are right, the schema should be accessible in a round-about way with minimum serialisation. I realise this would be one of those "add another layer of abstraction" solutions that might not be a good idea but the current state means that dataframes combined with some of the transformers in the pipeline api simply don't scale for python at least. > python dataframe ._jdf.schema().json() breaks on large metadata dataframes > -- > > Key: SPARK-12072 > URL: https://issues.apache.org/jira/browse/SPARK-12072 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Rares Mirica > > When a dataframe contains a column with a large number of values in ml_attr, > schema evaluation will routinely fail on getting the schema as json, this > will, in turn, cause a bunch of problems with, eg: calling udfs on the schema > because calling columns relies on > _parse_datatype_json_string(self._jdf.schema().json()) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11352) codegen.GeneratePredicate fails due to unquoted comment
[ https://issues.apache.org/jira/browse/SPARK-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033480#comment-15033480 ] Rares Mirica commented on SPARK-11352: -- Hi, I am sorry, am no longer able to find the original problem. is it possible to simply parse the strings at the point where the comment is added (though I don't see why you would even leave a comment there in the generated production code) and strip any problematic character? > codegen.GeneratePredicate fails due to unquoted comment > --- > > Key: SPARK-11352 > URL: https://issues.apache.org/jira/browse/SPARK-11352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Rares Mirica > > Somehow the code being generated ends up having comments with > comment-terminators unquoted, eg.: > /* ((input[35, StringType] <= > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8) && > (text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 <= input[36, > StringType])) */ > with emphasis on ... =0.9,*/... > This leads to a org.codehaus.commons.compiler.CompileException -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes
Rares Mirica created SPARK-12072: Summary: python dataframe ._jdf.schema().json() breaks on large metadata dataframes Key: SPARK-12072 URL: https://issues.apache.org/jira/browse/SPARK-12072 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.2 Reporter: Rares Mirica When a dataframe contains a column with a large number of values in ml_attr, schema evaluation will routinely fail on getting the schema as json, this will, in turn, cause a bunch of problems with, eg: calling udfs on the schema because calling columns relies on _parse_datatype_json_string(self._jdf.schema().json()) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11352) codegen.GeneratePredicate fails due to unquoted comment
Rares Mirica created SPARK-11352: Summary: codegen.GeneratePredicate fails due to unquoted comment Key: SPARK-11352 URL: https://issues.apache.org/jira/browse/SPARK-11352 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Rares Mirica Somehow the code being generated ends up having comments with comment-terminators unquoted, eg.: /* ((input[35, StringType] <= text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8) && (text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 <= input[36, StringType])) */ with emphasis on ... =0.9,*/... This leads to a org.codehaus.commons.compiler.CompileException -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org