[jira] [Closed] (SPARK-6363) Switch to Scala 2.11 for default build
[ https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga closed SPARK-6363. -- Resolved > Switch to Scala 2.11 for default build > -- > > Key: SPARK-6363 > URL: https://issues.apache.org/jira/browse/SPARK-6363 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: antonkulaga >Assignee: Josh Rosen >Priority: Minor > Labels: releasenotes > Fix For: 2.0.0 > > > Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 > support. So, it will be better if Spark binaries would be build with Scala > 2.11 by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949230#comment-16949230 ] antonkulaga commented on SPARK-28547: - >I bet there is room for improvement, but, ten thousand columns is just >inherently slow given how metadata, query plans, etc are handled. >You'd at least need to help narrow down where the slow down is and why, and >even better if you can propose a class of fix. As it is I'd close this. [~srowen] I am not a spark developer, I am spark user, so I cannot say where the bottleneck is. If I see that doing even super-simple tasks like describe or like some simple transformation (like taking log out of each gene expression values) fail, I report it as a performance problem. As I am bioinformatician, most of my work is about dealing with gene expressions (thousands of samples * tens of thousands genes) it makes Spark unusable for me for most of the use-cases If operations that take seconds in pandas dataframe (without any spark involved) take many hours or freeze in Spark dataframe there is something inherently wrong how you handle the data in Spark dataframe and something you should investigate for Spark 3.0 If you want to narrow it down, can it be "make dataframe.describe work for 15K * 15K dataframe and take less than 20 minutes to complete"? > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948897#comment-16948897 ] antonkulaga edited comment on SPARK-28547 at 10/10/19 7:24 PM: --- [~hyukjin.kwon] what is not clear for you? I think it is really clear that Spark performs miserably (freezing or taking hours/days to compute, even for simplest operations like per-column statistics) whenever the data frame has 10-20K and more columns and I gave GTEX dataset as an example (however any gene or transcript expression dataset will be ok to demonstrate it). In many fields (like big part of bioinformatics) wide data frames are common, right now Spark is totally useless there. was (Author: antonkulaga): [~hyukjin.kwon] what is not clear for you? I think it is really clear that Spark performs miserably (freezing or taking many hours) whenever the data frame has 10-20K and more columns and I gave GTEX dataset as an example (however any gene or transcript expression dataset will be ok to demonstrate it). In many fields (like big part of bioinformatics) wide data frames are common, right now Spark is totally useless there. > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948897#comment-16948897 ] antonkulaga commented on SPARK-28547: - [~hyukjin.kwon] what is not clear for you? I think it is really clear that Spark performs miserably (freezing or taking many hours) whenever the data frame has 10-20K and more columns and I gave GTEX dataset as an example (however any gene or transcript expression dataset will be ok to demonstrate it). In many fields (like big part of bioinformatics) wide data frames are common, right now Spark is totally useless there. > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga reopened SPARK-28547: - I did not see any solutions. > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 2.4.3 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897194#comment-16897194 ] antonkulaga commented on SPARK-28547: - [~maropu] I think I was quite clear: even describe works slow as hell. So the easiest way to reproduce is just to run describe on all numeric columns in GTEX. > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 2.4.3 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga updated SPARK-28547: Description: Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables (even simple "describe" functions applied to all the genes-columns) either takes ours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work well with pure pandas (without any spark involved). f was: Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables either takes ours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work well with pure pandas (without any spark involved). f > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 2.4.3 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes ours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work well with pure > pandas (without any spark involved). > f -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga updated SPARK-28547: Description: Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables (even simple "describe" functions applied to all the genes-columns) either takes hours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work fast (minutes) and well with pure pandas (without any spark involved). f was: Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables (even simple "describe" functions applied to all the genes-columns) either takes ours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work well with pure pandas (without any spark involved). f > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 2.4.3 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28547) Make it work for wide (> 10K columns data)
antonkulaga created SPARK-28547: --- Summary: Make it work for wide (> 10K columns data) Key: SPARK-28547 URL: https://issues.apache.org/jira/browse/SPARK-28547 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.3, 2.4.4 Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per node, 32 cores (tried different configurations of executors) Reporter: antonkulaga Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables either takes ours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work well with pure pandas (without any spark involved). f -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga updated SPARK-14220: Comment: was deleted (was: I suggest to use Spark 2.4.1 as there Scala 2.12 is not longer experimental) > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Sean Owen >Priority: Blocker > Labels: release-notes > Fix For: 2.4.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811726#comment-16811726 ] antonkulaga commented on SPARK-14220: - I suggest to use Spark 2.4.1 as there Scala 2.12 is not longer experimental > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Sean Owen >Priority: Blocker > Labels: release-notes > Fix For: 2.4.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674066#comment-16674066 ] antonkulaga commented on SPARK-25588: - Any updates on this? This bug blocks ADAM library and hence blocks most of bioinformaticians using Spark. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8); >
[jira] [Created] (SPARK-25198) org.apache.spark.sql.catalyst.parser.ParseException: DataType json is not supported.
antonkulaga created SPARK-25198: --- Summary: org.apache.spark.sql.catalyst.parser.ParseException: DataType json is not supported. Key: SPARK-25198 URL: https://issues.apache.org/jira/browse/SPARK-25198 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Environment: Ubuntu 18.04, Spark 2.3.1, org.postgresql:postgresql:42.2.4 Reporter: antonkulaga Whenever I try to save the dataframe with one of the columns with JSON string inside to the latest Postgres I get org.apache.spark.sql.catalyst.parser.ParseException: DataType json is not supported. As Postgres supports JSON well and I use the latest postgresql client I expect it to work. Here is an example of the code that crashes val columnTypes = """id integer, parameters json, title text, gsm text, gse text, organism text, characteristics text, molecule text, model text, description text, treatment_protocol text, extract_protocol text, source_name text,data_processing text, submission_date text,last_update_date text, status text, type text, contact text, gpl text""" myDataframe.write.format("jdbc").option("url", "jdbc:postgresql://db/sequencing").option("customSchema", columnTypes).option("dbtable", "test").option("user", "postgres").option("password", "changeme").save() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16406) Reference resolution for large number of columns should be faster
[ https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16580444#comment-16580444 ] antonkulaga commented on SPARK-16406: - Are you going to backport it to 2.3.2 as well? > Reference resolution for large number of columns should be faster > - > > Key: SPARK-16406 > URL: https://issues.apache.org/jira/browse/SPARK-16406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > Fix For: 2.4.0 > > > Resolving columns in a LogicalPlan on average takes n / 2 (n being the number > of columns). This gets problematic as soon as you try to resolve a large > number of columns (m) on a large table: O(m * n / 2) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems
[ https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100966#comment-16100966 ] antonkulaga commented on SPARK-4820: This issue is valid for Spark 2.2.0 on Ubuntu 16.04 and it is a BLOCKER! I am blocked in some projects because I cannot overcome this stupid error > Spark build encounters "File name too long" on some encrypted filesystems > - > > Key: SPARK-4820 > URL: https://issues.apache.org/jira/browse/SPARK-4820 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Patrick Wendell >Assignee: Theodore Vasiloudis >Priority: Minor > Fix For: 1.4.0 > > > This was reported by Luchesar Cekov on github along with a proposed fix. The > fix has some potential downstream issues (it will modify the classnames) so > until we understand better how many users are affected we aren't going to > merge it. However, I'd like to include the issue and workaround here. If you > encounter this issue please comment on the JIRA so we can assess the > frequency. > The issue produces this error: > {code} > [error] == Expanded type of tree == > [error] > [error] ConstantType(value = Constant(Throwable)) > [error] > [error] uncaught exception during compilation: java.io.IOException > [error] File name too long > [error] two errors found > {code} > The workaround is in maven under the compile options add: > {code} > + -Xmax-classfile-name > + 128 > {code} > In SBT add: > {code} > +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21531) CLONE - Spark build encounters "File name too long" on some encrypted filesystems
[ https://issues.apache.org/jira/browse/SPARK-21531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga updated SPARK-21531: Description: Originally this issue was discovere din Spark 1.x, then fixed in 1.x but it is still valid in 2.x This was reported by Luchesar Cekov on github along with a proposed fix. The fix has some potential downstream issues (it will modify the classnames) so until we understand better how many users are affected we aren't going to merge it. However, I'd like to include the issue and workaround here. If you encounter this issue please comment on the JIRA so we can assess the frequency. The issue produces this error: {code} [error] == Expanded type of tree == [error] [error] ConstantType(value = Constant(Throwable)) [error] [error] uncaught exception during compilation: java.io.IOException [error] File name too long [error] two errors found {code} The workaround is in maven under the compile options add: {code} + -Xmax-classfile-name + 128 {code} In SBT add: {code} +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), {code} was: This was reported by Luchesar Cekov on github along with a proposed fix. The fix has some potential downstream issues (it will modify the classnames) so until we understand better how many users are affected we aren't going to merge it. However, I'd like to include the issue and workaround here. If you encounter this issue please comment on the JIRA so we can assess the frequency. The issue produces this error: {code} [error] == Expanded type of tree == [error] [error] ConstantType(value = Constant(Throwable)) [error] [error] uncaught exception during compilation: java.io.IOException [error] File name too long [error] two errors found {code} The workaround is in maven under the compile options add: {code} + -Xmax-classfile-name + 128 {code} In SBT add: {code} +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), {code} > CLONE - Spark build encounters "File name too long" on some encrypted > filesystems > - > > Key: SPARK-21531 > URL: https://issues.apache.org/jira/browse/SPARK-21531 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: antonkulaga >Assignee: Theodore Vasiloudis > Fix For: 1.4.0 > > > Originally this issue was discovere din Spark 1.x, then fixed in 1.x but it > is still valid in 2.x > This was reported by Luchesar Cekov on github along with a proposed fix. The > fix has some potential downstream issues (it will modify the classnames) so > until we understand better how many users are affected we aren't going to > merge it. However, I'd like to include the issue and workaround here. If you > encounter this issue please comment on the JIRA so we can assess the > frequency. > The issue produces this error: > {code} > [error] == Expanded type of tree == > [error] > [error] ConstantType(value = Constant(Throwable)) > [error] > [error] uncaught exception during compilation: java.io.IOException > [error] File name too long > [error] two errors found > {code} > The workaround is in maven under the compile options add: > {code} > + -Xmax-classfile-name > + 128 > {code} > In SBT add: > {code} > +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21531) CLONE - Spark build encounters "File name too long" on some encrypted filesystems
[ https://issues.apache.org/jira/browse/SPARK-21531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga updated SPARK-21531: Description: Originally this issue was discovere din Spark 1.x, then fixed in 1.x but it is still valid in 2.x ORIGINAL description: This was reported by Luchesar Cekov on github along with a proposed fix. The fix has some potential downstream issues (it will modify the classnames) so until we understand better how many users are affected we aren't going to merge it. However, I'd like to include the issue and workaround here. If you encounter this issue please comment on the JIRA so we can assess the frequency. The issue produces this error: {code} [error] == Expanded type of tree == [error] [error] ConstantType(value = Constant(Throwable)) [error] [error] uncaught exception during compilation: java.io.IOException [error] File name too long [error] two errors found {code} The workaround is in maven under the compile options add: {code} + -Xmax-classfile-name + 128 {code} In SBT add: {code} +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), {code} was: Originally this issue was discovere din Spark 1.x, then fixed in 1.x but it is still valid in 2.x This was reported by Luchesar Cekov on github along with a proposed fix. The fix has some potential downstream issues (it will modify the classnames) so until we understand better how many users are affected we aren't going to merge it. However, I'd like to include the issue and workaround here. If you encounter this issue please comment on the JIRA so we can assess the frequency. The issue produces this error: {code} [error] == Expanded type of tree == [error] [error] ConstantType(value = Constant(Throwable)) [error] [error] uncaught exception during compilation: java.io.IOException [error] File name too long [error] two errors found {code} The workaround is in maven under the compile options add: {code} + -Xmax-classfile-name + 128 {code} In SBT add: {code} +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), {code} > CLONE - Spark build encounters "File name too long" on some encrypted > filesystems > - > > Key: SPARK-21531 > URL: https://issues.apache.org/jira/browse/SPARK-21531 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: antonkulaga >Assignee: Theodore Vasiloudis > Fix For: 1.4.0 > > > Originally this issue was discovere din Spark 1.x, then fixed in 1.x but it > is still valid in 2.x > ORIGINAL description: > This was reported by Luchesar Cekov on github along with a proposed fix. The > fix has some potential downstream issues (it will modify the classnames) so > until we understand better how many users are affected we aren't going to > merge it. However, I'd like to include the issue and workaround here. If you > encounter this issue please comment on the JIRA so we can assess the > frequency. > The issue produces this error: > {code} > [error] == Expanded type of tree == > [error] > [error] ConstantType(value = Constant(Throwable)) > [error] > [error] uncaught exception during compilation: java.io.IOException > [error] File name too long > [error] two errors found > {code} > The workaround is in maven under the compile options add: > {code} > + -Xmax-classfile-name > + 128 > {code} > In SBT add: > {code} > +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21531) CLONE - Spark build encounters "File name too long" on some encrypted filesystems
[ https://issues.apache.org/jira/browse/SPARK-21531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga updated SPARK-21531: Priority: Major (was: Minor) > CLONE - Spark build encounters "File name too long" on some encrypted > filesystems > - > > Key: SPARK-21531 > URL: https://issues.apache.org/jira/browse/SPARK-21531 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: antonkulaga >Assignee: Theodore Vasiloudis > Fix For: 1.4.0 > > > This was reported by Luchesar Cekov on github along with a proposed fix. The > fix has some potential downstream issues (it will modify the classnames) so > until we understand better how many users are affected we aren't going to > merge it. However, I'd like to include the issue and workaround here. If you > encounter this issue please comment on the JIRA so we can assess the > frequency. > The issue produces this error: > {code} > [error] == Expanded type of tree == > [error] > [error] ConstantType(value = Constant(Throwable)) > [error] > [error] uncaught exception during compilation: java.io.IOException > [error] File name too long > [error] two errors found > {code} > The workaround is in maven under the compile options add: > {code} > + -Xmax-classfile-name > + 128 > {code} > In SBT add: > {code} > +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21531) CLONE - Spark build encounters "File name too long" on some encrypted filesystems
antonkulaga created SPARK-21531: --- Summary: CLONE - Spark build encounters "File name too long" on some encrypted filesystems Key: SPARK-21531 URL: https://issues.apache.org/jira/browse/SPARK-21531 Project: Spark Issue Type: Improvement Components: Documentation Reporter: antonkulaga Assignee: Theodore Vasiloudis Priority: Minor Fix For: 1.4.0 This was reported by Luchesar Cekov on github along with a proposed fix. The fix has some potential downstream issues (it will modify the classnames) so until we understand better how many users are affected we aren't going to merge it. However, I'd like to include the issue and workaround here. If you encounter this issue please comment on the JIRA so we can assess the frequency. The issue produces this error: {code} [error] == Expanded type of tree == [error] [error] ConstantType(value = Constant(Throwable)) [error] [error] uncaught exception during compilation: java.io.IOException [error] File name too long [error] two errors found {code} The workaround is in maven under the compile options add: {code} + -Xmax-classfile-name + 128 {code} In SBT add: {code} +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937923#comment-15937923 ] antonkulaga commented on SPARK-14220: - Any progress on this? > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6363) make scala 2.11 default language
[ https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368846#comment-14368846 ] antonkulaga commented on SPARK-6363: > is already cross-built for 2.10 and 2.11, and published separately for both I mean Spark downloads where they provide only 2.10 versions and propose to build 2.11 from source. I think 2.11 should be default there > make scala 2.11 default language > > > Key: SPARK-6363 > URL: https://issues.apache.org/jira/browse/SPARK-6363 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: antonkulaga >Priority: Minor > Labels: scala > > Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 > support. So, it will be better if Spark binaries would be build with Scala > 2.11 by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6363) make scala 2.11 default language
antonkulaga created SPARK-6363: -- Summary: make scala 2.11 default language Key: SPARK-6363 URL: https://issues.apache.org/jira/browse/SPARK-6363 Project: Spark Issue Type: Improvement Components: Build Reporter: antonkulaga Priority: Minor Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 support. So, it will be better if Spark binaries would be build with Scala 2.11 by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org