[jira] [Created] (SPARK-20332) Avro/Parquet GenericFixed decimal is not read into Spark correctly
Justin Pihony created SPARK-20332: - Summary: Avro/Parquet GenericFixed decimal is not read into Spark correctly Key: SPARK-20332 URL: https://issues.apache.org/jira/browse/SPARK-20332 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Justin Pihony Priority: Minor Take the following code: spark-shell --packages org.apache.avro:avro:1.8.1 import org.apache.avro.{Conversions, LogicalTypes, Schema} import java.math.BigDecimal val dc = new Conversions.DecimalConversion() val javaBD = BigDecimal.valueOf(643.85924958) val schema = Schema.parse("{\"type\":\"record\",\"name\":\"Header\",\"namespace\":\"org.apache.avro.file\",\"fields\":["+ "{\"name\":\"COLUMN\",\"type\":[\"null\",{\"type\":\"fixed\",\"name\":\"COLUMN\","+ "\"size\":19,\"precision\":17,\"scale\":8,\"logicalType\":\"decimal\"}]}]}") val schemaDec = schema.getField("COLUMN").schema() val fieldSchema = if(schemaDec.getType() == Schema.Type.UNION) schemaDec.getTypes.get(1) else schemaDec val converted = dc.toFixed(javaBD, fieldSchema, LogicalTypes.decimal(javaBD.precision, javaBD.scale)) sqlContext.createDataFrame(List(("value",converted))) and you'll get this error: java.lang.UnsupportedOperationException: Schema for type org.apache.avro.generic.GenericFixed is not supported However if you write out a parquet file using the AvroParquetWriter and the above GenericFixed value (converted), then read it in via the DataFrameReader the decimal value that is retrieved is not accurate (ie. 643... above is listed as -0.5...) Even if not supported, is there any way to at least have it throw an UnsupportedOperationException as it does when you try to do it directly (as compared to read in from a file) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource
[ https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366259#comment-15366259 ] Justin Pihony commented on SPARK-14525: --- [~rxin] Given the bug found in SPARK-16401, the CreatableRelationProvider is not necessary. However it might be nice to have now that I've already implemented it. I can reduce the code by removing the CreatableRelationProvider aspect, so I would love your feedback on my PR. > DataFrameWriter's save method should delegate to jdbc for jdbc datasource > - > > Key: SPARK-14525 > URL: https://issues.apache.org/jira/browse/SPARK-14525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Justin Pihony >Priority: Minor > > If you call {code}df.write.format("jdbc")...save(){code} then you get an > error > bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not > allow create table as select > save is a more intuitive guess on the appropriate method to call, so the user > should not be punished for not knowing about the jdbc method. > Obviously, this will require the caller to have set up the correct parameters > for jdbc to work :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource
[ https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317846#comment-15317846 ] Justin Pihony commented on SPARK-14525: --- [~rxin] I have pushed my changes that now include implementing the CreatableRelationProvider. > DataFrameWriter's save method should delegate to jdbc for jdbc datasource > - > > Key: SPARK-14525 > URL: https://issues.apache.org/jira/browse/SPARK-14525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Justin Pihony >Priority: Minor > > If you call {code}df.write.format("jdbc")...save(){code} then you get an > error > bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not > allow create table as select > save is a more intuitive guess on the appropriate method to call, so the user > should not be punished for not knowing about the jdbc method. > Obviously, this will require the caller to have set up the correct parameters > for jdbc to work :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource
[ https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255136#comment-15255136 ] Justin Pihony commented on SPARK-14525: --- If I am to update the jdbc.DefaultSource to be a CreatableRelationProvider, then that also means that I have to update it to a SchemaRelationProvider. This would require a change to the JDBCRelation class so that it can optionally accept a user-specified schema. This is all possible and I see it as a change from either: {code} override val schema: StructType = JDBCRDD.resolveTable(url, table, properties) {code} To: {code} override val schema: StructType = { val resolvedSchema = JDBCRDD.resolveTable(url, table, properties) providedSchemaOption match { case Some(providedSchema) => { if(providedSchema == resolvedSchema) resolvedSchema else sys.error("User specified schema does not match the actual schema") } case None => resolvedSchema } } {code} Or, do the checking on initialization, which would not be lazy. Thoughts/Preferences? Should I just skip making it a CreatableRelationProvider if none of the above work? > DataFrameWriter's save method should delegate to jdbc for jdbc datasource > - > > Key: SPARK-14525 > URL: https://issues.apache.org/jira/browse/SPARK-14525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Justin Pihony >Priority: Minor > > If you call {code}df.write.format("jdbc")...save(){code} then you get an > error > bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not > allow create table as select > save is a more intuitive guess on the appropriate method to call, so the user > should not be punished for not knowing about the jdbc method. > Obviously, this will require the caller to have set up the correct parameters > for jdbc to work :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource
[ https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255130#comment-15255130 ] Justin Pihony commented on SPARK-14525: --- To address any concerns about taking Properties to a {code}Map[String,String]{code} please refer to this [StackOverflow question|https://stackoverflow.com/questions/873510/why-does-java-util-properties-implement-mapobject-object-and-not-mapstring-st] > DataFrameWriter's save method should delegate to jdbc for jdbc datasource > - > > Key: SPARK-14525 > URL: https://issues.apache.org/jira/browse/SPARK-14525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Justin Pihony >Priority: Minor > > If you call {code}df.write.format("jdbc")...save(){code} then you get an > error > bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not > allow create table as select > save is a more intuitive guess on the appropriate method to call, so the user > should not be punished for not knowing about the jdbc method. > Obviously, this will require the caller to have set up the correct parameters > for jdbc to work :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource
[ https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252020#comment-15252020 ] Justin Pihony commented on SPARK-14525: --- I am actually going to work on this today. I just got busy and was waiting for any further comments. I have what I need now and will push a PR today :) > DataFrameWriter's save method should delegate to jdbc for jdbc datasource > - > > Key: SPARK-14525 > URL: https://issues.apache.org/jira/browse/SPARK-14525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Justin Pihony >Priority: Minor > > If you call {code}df.write.format("jdbc")...save(){code} then you get an > error > bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not > allow create table as select > save is a more intuitive guess on the appropriate method to call, so the user > should not be punished for not knowing about the jdbc method. > Obviously, this will require the caller to have set up the correct parameters > for jdbc to work :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource
[ https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238693#comment-15238693 ] Justin Pihony commented on SPARK-14525: --- I don't see why not since they're just key/values anyway. Here's the code I put together before realizing I didn't like the first option: {code} dataSource.providingClass.newInstance() match { case x: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource => { val url = extraOptionsgetOrElse("url", sys.error("Saving jdbc source requires url to be set. (ie. df.option(\"url\", \"ACTUAL_URL\")")) val table = extraOptions.getOrElse("dbtable", extraOptions.getOrElse("table", sys.error("Saving jdbc source requires dbtable to be set. (ie. df.option(\"dbtable\", \"ACTUAL_DB_TABLE\")"))) //Rely on the impl of jdbc? which puts the user and password into the properties from extraOptions anyway? jdbc(url, table, new java.util.Properties) } case _ => dataSource.write(mode, df) } {code} > DataFrameWriter's save method should delegate to jdbc for jdbc datasource > - > > Key: SPARK-14525 > URL: https://issues.apache.org/jira/browse/SPARK-14525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Justin Pihony >Priority: Minor > > If you call {code}df.write.format("jdbc")...save(){code} then you get an > error > bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not > allow create table as select > save is a more intuitive guess on the appropriate method to call, so the user > should not be punished for not knowing about the jdbc method. > Obviously, this will require the caller to have set up the correct parameters > for jdbc to work :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource
[ https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Pihony updated SPARK-14525: -- Comment: was deleted (was: I don't see why not since they're just key/values anyway. Here's the code I put together before realizing I didn't like the first option: {code} dataSource.providingClass.newInstance() match { case x: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource => { val url = extraOptionsgetOrElse("url", sys.error("Saving jdbc source requires url to be set. (ie. df.option(\"url\", \"ACTUAL_URL\")")) val table = extraOptions.getOrElse("dbtable", extraOptions.getOrElse("table", sys.error("Saving jdbc source requires dbtable to be set. (ie. df.option(\"dbtable\", \"ACTUAL_DB_TABLE\")"))) //Rely on the impl of jdbc? which puts the user and password into the properties from extraOptions anyway? jdbc(url, table, new java.util.Properties) } case _ => dataSource.write(mode, df) } {code}) > DataFrameWriter's save method should delegate to jdbc for jdbc datasource > - > > Key: SPARK-14525 > URL: https://issues.apache.org/jira/browse/SPARK-14525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Justin Pihony >Priority: Minor > > If you call {code}df.write.format("jdbc")...save(){code} then you get an > error > bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not > allow create table as select > save is a more intuitive guess on the appropriate method to call, so the user > should not be punished for not knowing about the jdbc method. > Obviously, this will require the caller to have set up the correct parameters > for jdbc to work :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource
[ https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238690#comment-15238690 ] Justin Pihony commented on SPARK-14525: --- I don't see why not since they're just key/values anyway. Here's the code I put together before realizing I didn't like the first option: {code} dataSource.providingClass.newInstance() match { case x: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource => { val url = extraOptionsgetOrElse("url", sys.error("Saving jdbc source requires url to be set. (ie. df.option(\"url\", \"ACTUAL_URL\")")) val table = extraOptions.getOrElse("dbtable", extraOptions.getOrElse("table", sys.error("Saving jdbc source requires dbtable to be set. (ie. df.option(\"dbtable\", \"ACTUAL_DB_TABLE\")"))) //Rely on the impl of jdbc? which puts the user and password into the properties from extraOptions anyway? jdbc(url, table, new java.util.Properties) } case _ => dataSource.write(mode, df) } {code} > DataFrameWriter's save method should delegate to jdbc for jdbc datasource > - > > Key: SPARK-14525 > URL: https://issues.apache.org/jira/browse/SPARK-14525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Justin Pihony >Priority: Minor > > If you call {code}df.write.format("jdbc")...save(){code} then you get an > error > bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not > allow create table as select > save is a more intuitive guess on the appropriate method to call, so the user > should not be punished for not knowing about the jdbc method. > Obviously, this will require the caller to have set up the correct parameters > for jdbc to work :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource
[ https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238621#comment-15238621 ] Justin Pihony commented on SPARK-14525: --- I don't mind putting together a PR for this, however I am curious as to whether there is an opinion on the implementation. I see two options. Have the save method redirect to the jdbc method, or move the logic in the jdbc method into the jdbc.DefaultSource, allowing the DataFrameWriter to not have to be responsible; jdbc would delegate to save which would delegate to DataSource.write which would delegate to a new method in the jdbc.DefaultSource. After languishing on the seemingly unclean choice having save redirect to jdbc, I am leaning towards the second option. I think it's a better design choice. > DataFrameWriter's save method should delegate to jdbc for jdbc datasource > - > > Key: SPARK-14525 > URL: https://issues.apache.org/jira/browse/SPARK-14525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Justin Pihony >Priority: Minor > > If you call {code}df.write.format("jdbc")...save(){code} then you get an > error > bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not > allow create table as select > save is a more intuitive guess on the appropriate method to call, so the user > should not be punished for not knowing about the jdbc method. > Obviously, this will require the caller to have set up the correct parameters > for jdbc to work :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource
Justin Pihony created SPARK-14525: - Summary: DataFrameWriter's save method should delegate to jdbc for jdbc datasource Key: SPARK-14525 URL: https://issues.apache.org/jira/browse/SPARK-14525 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.1 Reporter: Justin Pihony Priority: Minor If you call {code}df.write.format("jdbc")...save(){code} then you get an error bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow create table as select save is a more intuitive guess on the appropriate method to call, so the user should not be punished for not knowing about the jdbc method. Obviously, this will require the caller to have set up the correct parameters for jdbc to work :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages
[ https://issues.apache.org/jira/browse/SPARK-13744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Pihony updated SPARK-13744: -- Attachment: Screen Shot 2016-03-08 at 10.35.51 AM.png I am using the Spark UI for the input readings. See the attached image please. > Dataframe RDD caching increases the input size for subsequent stages > > > Key: SPARK-13744 > URL: https://issues.apache.org/jira/browse/SPARK-13744 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: OSX >Reporter: Justin Pihony >Priority: Minor > Attachments: Screen Shot 2016-03-08 at 10.35.51 AM.png > > > Given the below code, you will see that the first run of count shows up as > ~90KB, and even the next run with cache being set will result in the same > input size. However, every subsequent run thereafter will result in an input > size that is MUCH larger (500MB is listed as 38% for a default run). This > size discrepancy seems to be a bug in the caching of a dataframe's RDD as far > as I can see. > {code} > import sqlContext.implicits._ > case class Person(name:String ="Test", number:Double = 1000.2) > val people = sc.parallelize(1 to 1000,50).map { p => Person()}.toDF > people.write.parquet("people.parquet") > val parquetFile = sqlContext.read.parquet("people.parquet") > parquetFile.rdd.count() > parquetFile.rdd.cache() > parquetFile.rdd.count() > parquetFile.rdd.count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages
Justin Pihony created SPARK-13744: - Summary: Dataframe RDD caching increases the input size for subsequent stages Key: SPARK-13744 URL: https://issues.apache.org/jira/browse/SPARK-13744 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Environment: OSX Reporter: Justin Pihony Priority: Minor Given the below code, you will see that the first run of count shows up as ~90KB, and even the next run with cache being set will result in the same input size. However, every subsequent run thereafter will result in an input size that is MUCH larger (500MB is listed as 38% for a default run). This size discrepancy seems to be a bug in the caching of a dataframe's RDD as far as I can see. {code} import sqlContext.implicits._ case class Person(name:String ="Test", number:Double = 1000.2) val people = sc.parallelize(1 to 1000,50).map { p => Person()}.toDF people.write.parquet("people.parquet") val parquetFile = sqlContext.read.parquet("people.parquet") parquetFile.rdd.count() parquetFile.rdd.cache() parquetFile.rdd.count() parquetFile.rdd.count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
Justin Pihony created SPARK-13127: - Summary: Upgrade Parquet to 1.9 (Fixes parquet sorting) Key: SPARK-13127 URL: https://issues.apache.org/jira/browse/SPARK-13127 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Justin Pihony Priority: Minor Currently, when you write a sorted DataFrame to Parquet, then reading the data back out is not sorted by default. [This is due to a bug in Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in 1.9. There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org