[jira] [Assigned] (SPARK-25259) Left/Right join support push down during-join predicates
[ https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25259: Assignee: (was: Apache Spark) > Left/Right join support push down during-join predicates > > > Key: SPARK-25259 > URL: https://issues.apache.org/jira/browse/SPARK-25259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:sql} > create temporary view EMPLOYEE as select * from values > ("10", "HAAS", "A00"), > ("10", "THOMPSON", "B01"), > ("30", "KWAN", "C01"), > ("000110", "LUCCHESSI", "A00"), > ("000120", "O'CONNELL", "A))"), > ("000130", "QUINTANA", "C01") > as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); > create temporary view DEPARTMENT as select * from values > ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), > ("B01", "PLANNING", "20"), > ("C01", "INFORMATION CENTER", "30"), > ("D01", "DEVELOPMENT CENTER", null) > as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); > create temporary view PROJECT as select * from values > ("AD3100", "ADMIN SERVICES", "D01"), > ("IF1000", "QUERY SERVICES", "C01"), > ("IF2000", "USER EDUCATION", "E01"), > ("MA2100", "WELD LINE AUDOMATION", "D01"), > ("PL2100", "WELD LINE PLANNING", "01") > as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); > {code} > below SQL: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} > can Optimized to: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25259) Left/Right join support push down during-join predicates
[ https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25259: Assignee: Apache Spark > Left/Right join support push down during-join predicates > > > Key: SPARK-25259 > URL: https://issues.apache.org/jira/browse/SPARK-25259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > For example: > {code:sql} > create temporary view EMPLOYEE as select * from values > ("10", "HAAS", "A00"), > ("10", "THOMPSON", "B01"), > ("30", "KWAN", "C01"), > ("000110", "LUCCHESSI", "A00"), > ("000120", "O'CONNELL", "A))"), > ("000130", "QUINTANA", "C01") > as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); > create temporary view DEPARTMENT as select * from values > ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), > ("B01", "PLANNING", "20"), > ("C01", "INFORMATION CENTER", "30"), > ("D01", "DEVELOPMENT CENTER", null) > as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); > create temporary view PROJECT as select * from values > ("AD3100", "ADMIN SERVICES", "D01"), > ("IF1000", "QUERY SERVICES", "C01"), > ("IF2000", "USER EDUCATION", "E01"), > ("MA2100", "WELD LINE AUDOMATION", "D01"), > ("PL2100", "WELD LINE PLANNING", "01") > as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); > {code} > below SQL: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} > can Optimized to: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25259) Left/Right join support push down during-join predicates
[ https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594606#comment-16594606 ] Apache Spark commented on SPARK-25259: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22250 > Left/Right join support push down during-join predicates > > > Key: SPARK-25259 > URL: https://issues.apache.org/jira/browse/SPARK-25259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:sql} > create temporary view EMPLOYEE as select * from values > ("10", "HAAS", "A00"), > ("10", "THOMPSON", "B01"), > ("30", "KWAN", "C01"), > ("000110", "LUCCHESSI", "A00"), > ("000120", "O'CONNELL", "A))"), > ("000130", "QUINTANA", "C01") > as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); > create temporary view DEPARTMENT as select * from values > ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), > ("B01", "PLANNING", "20"), > ("C01", "INFORMATION CENTER", "30"), > ("D01", "DEVELOPMENT CENTER", null) > as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); > create temporary view PROJECT as select * from values > ("AD3100", "ADMIN SERVICES", "D01"), > ("IF1000", "QUERY SERVICES", "C01"), > ("IF2000", "USER EDUCATION", "E01"), > ("MA2100", "WELD LINE AUDOMATION", "D01"), > ("PL2100", "WELD LINE PLANNING", "01") > as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); > {code} > below SQL: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} > can Optimized to: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25259) Left/Right join support push down during-join predicates
[ https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594563#comment-16594563 ] Yuming Wang commented on SPARK-25259: - I'm working on. > Left/Right join support push down during-join predicates > > > Key: SPARK-25259 > URL: https://issues.apache.org/jira/browse/SPARK-25259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:sql} > create temporary view EMPLOYEE as select * from values > ("10", "HAAS", "A00"), > ("10", "THOMPSON", "B01"), > ("30", "KWAN", "C01"), > ("000110", "LUCCHESSI", "A00"), > ("000120", "O'CONNELL", "A))"), > ("000130", "QUINTANA", "C01") > as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); > create temporary view DEPARTMENT as select * from values > ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), > ("B01", "PLANNING", "20"), > ("C01", "INFORMATION CENTER", "30"), > ("D01", "DEVELOPMENT CENTER", null) > as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); > create temporary view PROJECT as select * from values > ("AD3100", "ADMIN SERVICES", "D01"), > ("IF1000", "QUERY SERVICES", "C01"), > ("IF2000", "USER EDUCATION", "E01"), > ("MA2100", "WELD LINE AUDOMATION", "D01"), > ("PL2100", "WELD LINE PLANNING", "01") > as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); > {code} > below SQL: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} > can Optimized to: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25259) Left/Right join support push down during-join predicates
Yuming Wang created SPARK-25259: --- Summary: Left/Right join support push down during-join predicates Key: SPARK-25259 URL: https://issues.apache.org/jira/browse/SPARK-25259 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Yuming Wang For example: {code:sql} create temporary view EMPLOYEE as select * from values ("10", "HAAS", "A00"), ("10", "THOMPSON", "B01"), ("30", "KWAN", "C01"), ("000110", "LUCCHESSI", "A00"), ("000120", "O'CONNELL", "A))"), ("000130", "QUINTANA", "C01") as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); create temporary view DEPARTMENT as select * from values ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), ("B01", "PLANNING", "20"), ("C01", "INFORMATION CENTER", "30"), ("D01", "DEVELOPMENT CENTER", null) as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); create temporary view PROJECT as select * from values ("AD3100", "ADMIN SERVICES", "D01"), ("IF1000", "QUERY SERVICES", "C01"), ("IF2000", "USER EDUCATION", "E01"), ("MA2100", "WELD LINE AUDOMATION", "D01"), ("PL2100", "WELD LINE PLANNING", "01") as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); {code} below SQL: {code:sql} SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D ON P.DEPTNO = D.DEPTNO AND P.DEPTNO='E01'; {code} can Optimized to: {code:sql} SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D ON P.DEPTNO = D.DEPTNO AND P.DEPTNO='E01'; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25258) Upgrade kryo package to version 4.0.2+
[ https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594557#comment-16594557 ] Yuming Wang commented on SPARK-25258: - I have submitted PR: https://github.com/apache/spark/pull/22179 cc [~srowen] > Upgrade kryo package to version 4.0.2+ > -- > > Key: SPARK-25258 > URL: https://issues.apache.org/jira/browse/SPARK-25258 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > > Recently, we encountered a kryo performance issue in spark2.1.0, and the > issue affect all kryo below 4.0.2, so it seems that all spark version might > encounter this issue. > Issue description: > In shuffle write phase or some spilling operation, spark will use kryo > serializer to serialize data if `spark.serializer` is set to > `KryoSerializer`, however, when data contains some extremely large records, > kryoSerializer's MapReferenceResolver would be expand, and it's `reset` > method will take a long time to reset all items in writtenObjects table to > null. > com.esotericsoftware.kryo.util.MapReferenceResolver > {code:java} > public void reset () { > readObjects.clear(); > writtenObjects.clear(); > } > public void clear () { > K[] keyTable = this.keyTable; > for (int i = capacity + stashSize; i-- > 0;) > keyTable[i] = null; > size = 0; > stashSize = 0; > } > {code} > I checked the kryo project in github, and this issue seems fixed in 4.0.2+ > [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] > > I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix > this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594546#comment-16594546 ] Damian Momot commented on SPARK-4502: - I can see that this ticket was closed, but by looking at [https://github.com/apache/spark/pull/21320] only very basic scenario is supported and feature itself is disabled by default Are there any follow up tickets to track full feature implementation? > Spark SQL reads unneccesary nested fields from Parquet > -- > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Liwen Sun >Assignee: Michael Allman >Priority: Critical > Fix For: 2.4.0 > > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25256) Plan mismatch errors in Hive tests in 2.12
[ https://issues.apache.org/jira/browse/SPARK-25256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25256: - Shepherd: (was: Sean Owen) > Plan mismatch errors in Hive tests in 2.12 > -- > > Key: SPARK-25256 > URL: https://issues.apache.org/jira/browse/SPARK-25256 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > In Hive tests, in the Scala 2.12 build, still seeing a few failures that seem > to show mismatching schema inference. Not clear whether it's the same as > SPARK-25044. Examples: > {code:java} > - SPARK-5775 read array from partitioned_parquet_with_key_and_complextypes > *** FAILED *** > Results do not match for query: > Timezone: > sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]] > Timezone Env: > == Parsed Logical Plan == > 'Project ['arrayField, 'p] > +- 'Filter ('p = 1) > +- 'UnresolvedRelation `partitioned_parquet_with_key_and_complextypes` > == Analyzed Logical Plan == > arrayField: array, p: int > Project [arrayField#82569, p#82570] > +- Filter (p#82570 = 1) > +- SubqueryAlias `default`.`partitioned_parquet_with_key_and_complextypes` > +- > Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570] > parquet > == Optimized Logical Plan == > Project [arrayField#82569, p#82570] > +- Filter (isnotnull(p#82570) && (p#82570 = 1)) > +- > Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570] > parquet > == Physical Plan == > *(1) Project [arrayField#82569, p#82570] > +- *(1) FileScan parquet > default.partitioned_parquet_with_key_and_complextypes[arrayField#82569,p#82570] > Batched: false, Format: Parquet, Location: > PrunedInMemoryFileIndex[file:/home/srowen/spark-2.12/sql/hive/target/tmp/spark-d8d87d74-33e7-4f22..., > PartitionCount: 1, PartitionFilters: [isnotnull(p#82570), (p#82570 = 1)], > PushedFilters: [], ReadSchema: struct> > == Results == > == Results == > !== Correct Answer - 10 == == Spark Answer - 10 == > !struct<> struct,p:int> > ![Range 1 to 1,1] [WrappedArray(1),1] > ![Range 1 to 10,1] [WrappedArray(1, 2),1] > ![Range 1 to 2,1] [WrappedArray(1, 2, 3),1] > ![Range 1 to 3,1] [WrappedArray(1, 2, 3, 4),1] > ![Range 1 to 4,1] [WrappedArray(1, 2, 3, 4, 5),1] > ![Range 1 to 5,1] [WrappedArray(1, 2, 3, 4, 5, 6),1] > ![Range 1 to 6,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7),1] > ![Range 1 to 7,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8),1] > ![Range 1 to 8,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9),1] > ![Range 1 to 9,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),1] > (QueryTest.scala:163){code} > {code:java} > - SPARK-2693 udaf aggregates test *** FAILED *** > Results do not match for query: > Timezone: > sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]] > Timezone Env: > == Parsed Logical Plan == > 'GlobalLimit 1 > +- 'LocalLimit 1 > +- 'Project [unresolvedalias('percentile('key, 'array(1, 1)), None)] > +- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > percentile(key, array(1, 1), 1): array > GlobalLimit 1 > +- LocalLimit 1 > +- Aggregate [percentile(key#205098, cast(array(1, 1) as array), 1, > 0, 0) AS percentile(key, array(1, 1), 1)#205101] > +- SubqueryAlias `default`.`src` > +- HiveTableRelation `default`.`src`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099] > == Optimized Logical Plan == > GlobalLimit 1 > +- LocalLimit 1 > +- Aggregate [percentile(key#205098, [1.0,1.0], 1, 0, 0) AS percentile(key, > array(1, 1), 1)#205101] > +- Project [key#205098] > +- HiveTableRelation `default`.`src`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099] > == Physical Plan == > CollectLimit 1 > +- ObjectHashAggregate(keys=[], functions=[percentile(key#205098, [1.0,1.0], > 1, 0, 0)], output=[percentile(key, array(1, 1), 1)#205101]) > +- Exchange SinglePartition > +- ObjectHashAggregate(keys=[], functions=[partial_percentile(key#205098, > [1.0,1.0], 1, 0, 0)], output=[buf#205104]) > +- Scan hive default.src [key#205098], HiveTableRelation `default`.`src`,
[jira] [Commented] (SPARK-25251) Make spark-csv's `quote` and `escape` options conform to RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-25251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594522#comment-16594522 ] Hyukjin Kwon commented on SPARK-25251: -- This is a duplicate of SPARK-22236 > Make spark-csv's `quote` and `escape` options conform to RFC 4180 > - > > Key: SPARK-25251 > URL: https://issues.apache.org/jira/browse/SPARK-25251 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0, 2.3.1, 2.4.0, 3.0.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > As described in [RFC-4180|https://tools.ietf.org/html/rfc4180], page 2 - > {noformat} >7. If double-quotes are used to enclose fields, then a double-quote > appearing inside a field must be escaped by preceding it with another double > quote > {noformat} > That's what Excel does, for example, by default. > Although in Spark (as of Spark 2.1), escaping is done by default through > non-RFC way, using backslah (\). To fix this you have to explicitly tell > Spark to use doublequote to use for as an escape character: > {code} > .option('quote', '"') > .option('escape', '"') > {code} > This may explain that a comma character wasn't interpreted as it was inside a > quoted column. > So this is request to make spark-csv reader RFC-4180 compatible in regards to > default option values for `quote` and `escape` (make both equal to " ). > Since this is a backward-incompatible change, Spark 3.0 might be a good > release for this change. > Some more background - https://stackoverflow.com/a/45138591/470583 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25251) Make spark-csv's `quote` and `escape` options conform to RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-25251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25251. -- Resolution: Duplicate > Make spark-csv's `quote` and `escape` options conform to RFC 4180 > - > > Key: SPARK-25251 > URL: https://issues.apache.org/jira/browse/SPARK-25251 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0, 2.3.1, 2.4.0, 3.0.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > As described in [RFC-4180|https://tools.ietf.org/html/rfc4180], page 2 - > {noformat} >7. If double-quotes are used to enclose fields, then a double-quote > appearing inside a field must be escaped by preceding it with another double > quote > {noformat} > That's what Excel does, for example, by default. > Although in Spark (as of Spark 2.1), escaping is done by default through > non-RFC way, using backslah (\). To fix this you have to explicitly tell > Spark to use doublequote to use for as an escape character: > {code} > .option('quote', '"') > .option('escape', '"') > {code} > This may explain that a comma character wasn't interpreted as it was inside a > quoted column. > So this is request to make spark-csv reader RFC-4180 compatible in regards to > default option values for `quote` and `escape` (make both equal to " ). > Since this is a backward-incompatible change, Spark 3.0 might be a good > release for this change. > Some more background - https://stackoverflow.com/a/45138591/470583 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23255) Add user guide and examples for DataFrame image reading functions
[ https://issues.apache.org/jira/browse/SPARK-23255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594514#comment-16594514 ] Hyukjin Kwon commented on SPARK-23255: -- BTW, please be concise and precise on the documentation and contents since this is also one of the main feature in Spark 2.3. > Add user guide and examples for DataFrame image reading functions > - > > Key: SPARK-23255 > URL: https://issues.apache.org/jira/browse/SPARK-23255 > Project: Spark > Issue Type: Documentation > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Minor > > SPARK-21866 added built-in support for reading image data into a DataFrame. > This new functionality should be documented in the user guide, with example > usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23255) Add user guide and examples for DataFrame image reading functions
[ https://issues.apache.org/jira/browse/SPARK-23255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594513#comment-16594513 ] Hyukjin Kwon commented on SPARK-23255: -- Yea, I think so. For instance, https://github.com/apache/spark/pull/22121 > Add user guide and examples for DataFrame image reading functions > - > > Key: SPARK-23255 > URL: https://issues.apache.org/jira/browse/SPARK-23255 > Project: Spark > Issue Type: Documentation > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Minor > > SPARK-21866 added built-in support for reading image data into a DataFrame. > This new functionality should be documented in the user guide, with example > usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21232) New built-in SQL function - Data_Type
[ https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21232. -- Resolution: Won't Fix Resolving this per the discussion in the JIRA > New built-in SQL function - Data_Type > - > > Key: SPARK-21232 > URL: https://issues.apache.org/jira/browse/SPARK-21232 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 2.1.1 >Reporter: Mario Molina >Priority: Minor > > This function returns the data type of a given column. > {code:java} > data_type("a") > // returns string > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23255) Add user guide and examples for DataFrame image reading functions
[ https://issues.apache.org/jira/browse/SPARK-23255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594488#comment-16594488 ] Divay Jindal commented on SPARK-23255: -- Hey, I have a very naive doubt, for this task do i need to add guide into docs/ directory or is it code level documentation and example? > Add user guide and examples for DataFrame image reading functions > - > > Key: SPARK-23255 > URL: https://issues.apache.org/jira/browse/SPARK-23255 > Project: Spark > Issue Type: Documentation > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Minor > > SPARK-21866 added built-in support for reading image data into a DataFrame. > This new functionality should be documented in the user guide, with example > usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24391) from_json should support arrays of primitives, and more generally all JSON
[ https://issues.apache.org/jira/browse/SPARK-24391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594479#comment-16594479 ] Hyukjin Kwon commented on SPARK-24391: -- For to_json, separate JIRA was filed in SPARK-25252 > from_json should support arrays of primitives, and more generally all JSON > --- > > Key: SPARK-24391 > URL: https://issues.apache.org/jira/browse/SPARK-24391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sam Kitajima-Kimbrel >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > https://issues.apache.org/jira/browse/SPARK-19849 and > https://issues.apache.org/jira/browse/SPARK-21513 brought support for more > column types to functions.to_json/from_json, but I also have cases where I'd > like to simply (de)serialize an array of primitives to/from JSON when > outputting to certain destinations, which does not work: > {code:java} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq("[1, 2, 3]").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> val schema = new ArrayType(IntegerType, false) > schema: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,false) > scala> df.select(from_json($"a", schema)) > org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`a`)' > due to data type mismatch: Input schema array must be a struct or an > array of structs.;; > 'Project [jsontostructs(ArrayType(IntegerType,false), a#3, > Some(America/Los_Angeles)) AS jsontostructs(a)#10] > scala> val arrayDf = Seq(Array(1, 2, 3)).toDF("arr") > arrayDf: org.apache.spark.sql.DataFrame = [arr: array] > scala> arrayDf.select(to_json($"arr")) > org.apache.spark.sql.AnalysisException: cannot resolve 'structstojson(`arr`)' > due to data type mismatch: Input type array must be a struct, array of > structs or a map or array of map.;; > 'Project [structstojson(arr#19, Some(America/Los_Angeles)) AS > structstojson(arr)#26] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24391) from_json should support arrays of primitives, and more generally all JSON
[ https://issues.apache.org/jira/browse/SPARK-24391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24391: - Summary: from_json should support arrays of primitives, and more generally all JSON (was: to_json/from_json should support arrays of primitives, and more generally all JSON ) > from_json should support arrays of primitives, and more generally all JSON > --- > > Key: SPARK-24391 > URL: https://issues.apache.org/jira/browse/SPARK-24391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sam Kitajima-Kimbrel >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > https://issues.apache.org/jira/browse/SPARK-19849 and > https://issues.apache.org/jira/browse/SPARK-21513 brought support for more > column types to functions.to_json/from_json, but I also have cases where I'd > like to simply (de)serialize an array of primitives to/from JSON when > outputting to certain destinations, which does not work: > {code:java} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq("[1, 2, 3]").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> val schema = new ArrayType(IntegerType, false) > schema: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,false) > scala> df.select(from_json($"a", schema)) > org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`a`)' > due to data type mismatch: Input schema array must be a struct or an > array of structs.;; > 'Project [jsontostructs(ArrayType(IntegerType,false), a#3, > Some(America/Los_Angeles)) AS jsontostructs(a)#10] > scala> val arrayDf = Seq(Array(1, 2, 3)).toDF("arr") > arrayDf: org.apache.spark.sql.DataFrame = [arr: array] > scala> arrayDf.select(to_json($"arr")) > org.apache.spark.sql.AnalysisException: cannot resolve 'structstojson(`arr`)' > due to data type mismatch: Input type array must be a struct, array of > structs or a map or array of map.;; > 'Project [structstojson(arr#19, Some(America/Los_Angeles)) AS > structstojson(arr)#26] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows
[ https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25213. - Resolution: Fixed Assignee: Li Jin Fix Version/s: 2.4.0 > DataSourceV2 doesn't seem to produce unsafe rows > - > > Key: SPARK-25213 > URL: https://issues.apache.org/jira/browse/SPARK-25213 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > Fix For: 2.4.0 > > > Reproduce (Need to compile test-classes): > bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes > {code:java} > datasource_v2_df = spark.read \ > .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") > \ > .load() > result = datasource_v2_df.withColumn('x', udf(lambda x: x, > 'int')(datasource_v2_df['i'])) > result.show() > {code} > The above code fails with: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > {code} > Seems like Data Source V2 doesn't produce unsafeRows here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24721: --- Assignee: Li Jin > Failed to use PythonUDF with literal inputs in filter with data sources > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Assignee: Li Jin >Priority: Major > Fix For: 2.4.0 > > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at
[jira] [Resolved] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24721. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22104 [https://github.com/apache/spark/pull/22104] > Failed to use PythonUDF with literal inputs in filter with data sources > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Assignee: Li Jin >Priority: Major > Fix For: 2.4.0 > > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
[jira] [Created] (SPARK-25258) Upgrade kryo package to version 4.0.2+
liupengcheng created SPARK-25258: Summary: Upgrade kryo package to version 4.0.2+ Key: SPARK-25258 URL: https://issues.apache.org/jira/browse/SPARK-25258 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 2.3.1, 2.1.0 Reporter: liupengcheng Recently, we encountered a kryo performance issue in spark2.1.0, and the issue affect all kryo below 4.0.2, so it seems that all spark version might encounter this issue. Issue description: In shuffle write phase or some spilling operation, spark will use kryo serializer to serialize data if `spark.serializer` is set to `KryoSerializer`, however, when data contains some extremely large records, kryoSerializer's MapReferenceResolver would be expand, and it's `reset` method will take a long time to reset all items in writtenObjects table to null. com.esotericsoftware.kryo.util.MapReferenceResolver {code:java} public void reset () { readObjects.clear(); writtenObjects.clear(); } public void clear () { K[] keyTable = this.keyTable; for (int i = capacity + stashSize; i-- > 0;) keyTable[i] = null; size = 0; stashSize = 0; } {code} I checked the kryo project in github, and this issue seems fixed in 4.0.2+ [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers
[ https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23207: Fix Version/s: 2.1.4 > Shuffle+Repartition on an DataFrame could lead to incorrect answers > --- > > Key: SPARK-23207 > URL: https://issues.apache.org/jira/browse/SPARK-23207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Blocker > Labels: correctness > Fix For: 2.1.4, 2.2.3, 2.3.0 > > > Currently shuffle repartition uses RoundRobinPartitioning, the generated > result is nondeterministic since the sequence of input rows are not > determined. > The bug can be triggered when there is a repartition call following a shuffle > (which would lead to non-deterministic row ordering), as the pattern shows > below: > upstream stage -> repartition stage -> result stage > (-> indicate a shuffle) > When one of the executors process goes down, some tasks on the repartition > stage will be retried and generate inconsistent ordering, and some tasks of > the result stage will be retried generating different data. > The following code returns 931532, instead of 100: > {code} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => > x > }.repartition(200).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25164) Parquet reader builds entire list of columns once for each column
[ https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25164: Fix Version/s: 2.3.2 2.2.3 > Parquet reader builds entire list of columns once for each column > - > > Key: SPARK-25164 > URL: https://issues.apache.org/jira/browse/SPARK-25164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.2.3, 2.3.2, 2.4.0 > > > {{VectorizedParquetRecordReader.initializeInternal}} loops through each > column, and for each column it calls > {noformat} > requestedSchema.getColumns().get(i) > {noformat} > However, {{MessageType.getColumns}} will build the entire column list from > getPaths(0). > {noformat} > public List getColumns() { > List paths = this.getPaths(0); > List columns = new > ArrayList(paths.size()); > for (String[] path : paths) { > // TODO: optimize this > > PrimitiveType primitiveType = getType(path).asPrimitiveType(); > columns.add(new ColumnDescriptor( > path, > primitiveType, > getMaxRepetitionLevel(path), > getMaxDefinitionLevel(path))); > } > return columns; > } > {noformat} > This means that for each parquet file, this routine indirectly iterates > colCount*colCount times. > This is actually not particularly noticeable unless you have: > - many parquet files > - many columns > To verify that this is an issue, I created a 1 million record parquet table > with 6000 columns of type double and 67 files (so initializeInternal is > called 67 times). I ran the following query: > {noformat} > sql("select * from 6000_1m_double where id1 = 1").collect > {noformat} > I used Spark from the master branch. I had 8 executor threads. The filter > returns only a few thousand records. The query ran (on average) for 6.4 > minutes. > Then I cached the column list at the top of {{initializeInternal}} as follows: > {noformat} > List columnCache = requestedSchema.getColumns(); > {noformat} > Then I changed {{initializeInternal}} to use {{columnCache}} rather than > {{requestedSchema.getColumns()}}. > With the column cache variable, the same query runs in 5 minutes. So with my > simple query, you save %22 of time by not rebuilding the column list for each > column. > You get additional savings with a paths cache variable, now saving 34% in > total on the above query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25257) v2 MicroBatchReaders can't resume from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594298#comment-16594298 ] Seth Fitzsimmons commented on SPARK-25257: -- I traced this a bit further; {{OffsetSeq.toStreamProgress}} looks like it might be the right place to deserialize. {{deserialize.patch}} deserializes if the source is a {{MicroBatchReader}} (i.e. has a {{deserializeOffset}} method). {{runBatch}} seems partially right (it's specific to {{MicroBatchReader}}s), but {{available}} (variable in scope) may be either a {{SerializedOffset}} or {{Offset}} depending whether a checkpoint is being resumed or if it's starting from a clean slate. > v2 MicroBatchReaders can't resume from checkpoints > -- > > Key: SPARK-25257 > URL: https://issues.apache.org/jira/browse/SPARK-25257 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Seth Fitzsimmons >Priority: Major > Attachments: deserialize.patch > > > When resuming from a checkpoint: > {code:java} > writeStream.option("checkpointLocation", > "/tmp/checkpoint").format("console").start > {code} > The stream reader fails with: > {noformat} > osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287 > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to > org.apache.spark.sql.sources.v2.reader.streaming.Offset > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at > org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) >
[jira] [Updated] (SPARK-25257) v2 MicroBatchReaders can't resume from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Fitzsimmons updated SPARK-25257: - Description: When resuming from a checkpoint: {code:java} writeStream.option("checkpointLocation", "/tmp/checkpoint").format("console").start {code} The stream reader fails with: {noformat} osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287 at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) Caused by: java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) ... 1 more {noformat} The root cause appears to be that the {{SerializedOffset}} (JSON, from disk) is never deserialized; I would expect to see something along the lines of {{reader.deserializeOffset(off.json)}} here (unless {{available}} is intended to be deserialized elsewhere): [https://github.com/apache/spark/blob/4dc82259d81102e0cb48f4cb2e8075f80d899ac4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L405] was: When resuming from a checkpoint: {code:java} writeStream.option("checkpointLocation", "/tmp/checkpoint").format("console").start {code} The stream reader fails with: {noformat} osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287 at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at o
[jira] [Updated] (SPARK-25257) v2 MicroBatchReaders can't resume from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Fitzsimmons updated SPARK-25257: - Summary: v2 MicroBatchReaders can't resume from checkpoints (was: java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset) > v2 MicroBatchReaders can't resume from checkpoints > -- > > Key: SPARK-25257 > URL: https://issues.apache.org/jira/browse/SPARK-25257 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Seth Fitzsimmons >Priority: Major > Attachments: deserialize.patch > > > When resuming from a checkpoint: > {code:java} > writeStream.option("checkpointLocation", > "/tmp/checkpoint").format("console").start > {code} > The stream reader fails with: > {noformat} > osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287 > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to > org.apache.spark.sql.sources.v2.reader.streaming.Offset > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at > org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > ... 1 more > {noformat} > The root cause appears to be that the `SerializedOffset` (JSON, from disk) is > never deserialized; I would expect to
[jira] [Updated] (SPARK-25257) java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset
[ https://issues.apache.org/jira/browse/SPARK-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Fitzsimmons updated SPARK-25257: - Attachment: deserialize.patch > java.lang.ClassCastException: > org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to > org.apache.spark.sql.sources.v2.reader.streaming.Offset > - > > Key: SPARK-25257 > URL: https://issues.apache.org/jira/browse/SPARK-25257 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Seth Fitzsimmons >Priority: Major > Attachments: deserialize.patch > > > When resuming from a checkpoint: > {code:java} > writeStream.option("checkpointLocation", > "/tmp/checkpoint").format("console").start > {code} > The stream reader fails with: > {noformat} > osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287 > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to > org.apache.spark.sql.sources.v2.reader.streaming.Offset > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at > org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > ... 1 more > {noformat} > The root cause appears to be that the `SerializedOffset` (JSON, from disk) is > never deseria
[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594281#comment-16594281 ] Thomas Graves commented on SPARK-25250: --- We are hitting a race condition here between the taskSetManager and the DAGScheduler. There is code in the TaskSetManager that is supposed to mark all task attempts for a partition completed when any of them succeed, but in this case the second attempt has been finished. There is also code that only starts tasks in the second attempt that have not yet finished, but again there is a race here between when the taskSetManager sends the message that the task has ended and when it starts a new stage attempt. In the example given above stage 4.0 has fetch failed, it reran the map stage, task 9000 for partition 9000 in stage 4.0 finishes and sends a taskEnded messages to DAGSCheduler, before the DAGScheduler processed that task finished, it calculates the tasks needed for stage 4.1 which included the task for partition 9000, so it runs a task for partition 9000 but it always just fails with commitDenied and continues to rerun that task. when task 9000 for stage 4.0 finished the taskSetManager calls into sched.markPartitionCompletedInAllTaskSets but since the 4.1 stage attempt hadn't been created yet it didn't mark the task 2000 for that partition as completed and since the DAGScheduler hadn't processed the taskEnd event for it, when it started stage 4.1 it started a task when it didn't need to. We need to figure out how to handle the race. > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Priority: Major > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times until the job finally succeeds. It doesn't > cause any job failures because the DAG scheduler is tracking the partitions > separate from the task set managers. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data
[ https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594271#comment-16594271 ] Erik Erlandson commented on SPARK-21097: I'm wondering if this is going to be subsumed by the Shuffle Service redesign proposal. cc [~mcheah] > Dynamic allocation will preserve cached data > > > Key: SPARK-21097 > URL: https://issues.apache.org/jira/browse/SPARK-21097 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Scheduler, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Brad >Priority: Major > Attachments: Preserving Cached Data with Dynamic Allocation.pdf > > > We want to use dynamic allocation to distribute resources among many notebook > users on our spark clusters. One difficulty is that if a user has cached data > then we are either prevented from de-allocating any of their executors, or we > are forced to drop their cached data, which can lead to a bad user experience. > We propose adding a feature to preserve cached data by copying it to other > executors before de-allocation. This behavior would be enabled by a simple > spark config. Now when an executor reaches its configured idle timeout, > instead of just killing it on the spot, we will stop sending it new tasks, > replicate all of its rdd blocks onto other executors, and then kill it. If > there is an issue while we replicate the data, like an error, it takes too > long, or there isn't enough space, then we will fall back to the original > behavior and drop the data and kill the executor. > This feature should allow anyone with notebook users to use their cluster > resources more efficiently. Also since it will be completely opt-in it will > unlikely to cause problems for other use cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25257) java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset
Seth Fitzsimmons created SPARK-25257: Summary: java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset Key: SPARK-25257 URL: https://issues.apache.org/jira/browse/SPARK-25257 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.1 Reporter: Seth Fitzsimmons When resuming from a checkpoint: {code:java} writeStream.option("checkpointLocation", "/tmp/checkpoint").format("console").start {code} The stream reader fails with: {noformat} osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287 at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) Caused by: java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) ... 1 more {noformat} The root cause appears to be that the `SerializedOffset` (JSON, from disk) is never deserialized; I would expect to see something along the lines of `reader.deserializeOffset(off.json)` here (unless `available` is intended to be deserialized elsewhere): https://github.com/apache/spark/blob/4dc82259d81102e0cb48f4cb2e8075f80d899ac4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L405 -- This message was sent by Atlassian JIRA (v7.6.3#76005) -
[jira] [Resolved] (SPARK-24090) Kubernetes Backend Hotlist for Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-24090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24090. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 4 [https://github.com/apache/spark/pull/4] > Kubernetes Backend Hotlist for Spark 2.4 > > > Key: SPARK-24090 > URL: https://issues.apache.org/jira/browse/SPARK-24090 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Scheduler >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Assignee: Anirudh Ramanathan >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594213#comment-16594213 ] Dongjoon Hyun commented on SPARK-25175: --- Okay, thank you for the details, [~seancxmao]. BTW, for me, the higher purpose is JIRA is the reverse of this JIRA title. What you are aiming is to support `Case Sensitivity` in Spark, right? In general, `spark.sql.caseSensitive=true` has been `false` by default since Spark 2.0.0. And, by default, Spark prevents you from generating such data. (not only ORC.) {code} scala> data.write.format("orc").mode("overwrite").save("/tmp/orc") org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/orc: `c`; {code} Thus, your use case is specifically *generating* data with `spark.sql.caseSensitive=true` and reading data with `spark.sql.caseSensitive=false` (without problems). Do you really have a *case-sensitive* data(schema)? Could you elaborate more why you hit this situation? > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the lower case field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. When there > are duplicate fields (e.g. c, C), we just can't read hive serde tables. If > there are no duplicate fields, hive serde tables always do case-insensitive > field resolution regardless of case sensitivity mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25256) Plan mismatch errors in Hive tests in 2.12
Sean Owen created SPARK-25256: - Summary: Plan mismatch errors in Hive tests in 2.12 Key: SPARK-25256 URL: https://issues.apache.org/jira/browse/SPARK-25256 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Sean Owen In Hive tests, in the Scala 2.12 build, still seeing a few failures that seem to show mismatching schema inference. Not clear whether it's the same as SPARK-25044. Examples: {code:java} - SPARK-5775 read array from partitioned_parquet_with_key_and_complextypes *** FAILED *** Results do not match for query: Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]] Timezone Env: == Parsed Logical Plan == 'Project ['arrayField, 'p] +- 'Filter ('p = 1) +- 'UnresolvedRelation `partitioned_parquet_with_key_and_complextypes` == Analyzed Logical Plan == arrayField: array, p: int Project [arrayField#82569, p#82570] +- Filter (p#82570 = 1) +- SubqueryAlias `default`.`partitioned_parquet_with_key_and_complextypes` +- Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570] parquet == Optimized Logical Plan == Project [arrayField#82569, p#82570] +- Filter (isnotnull(p#82570) && (p#82570 = 1)) +- Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570] parquet == Physical Plan == *(1) Project [arrayField#82569, p#82570] +- *(1) FileScan parquet default.partitioned_parquet_with_key_and_complextypes[arrayField#82569,p#82570] Batched: false, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/home/srowen/spark-2.12/sql/hive/target/tmp/spark-d8d87d74-33e7-4f22..., PartitionCount: 1, PartitionFilters: [isnotnull(p#82570), (p#82570 = 1)], PushedFilters: [], ReadSchema: struct> == Results == == Results == !== Correct Answer - 10 == == Spark Answer - 10 == !struct<> struct,p:int> ![Range 1 to 1,1] [WrappedArray(1),1] ![Range 1 to 10,1] [WrappedArray(1, 2),1] ![Range 1 to 2,1] [WrappedArray(1, 2, 3),1] ![Range 1 to 3,1] [WrappedArray(1, 2, 3, 4),1] ![Range 1 to 4,1] [WrappedArray(1, 2, 3, 4, 5),1] ![Range 1 to 5,1] [WrappedArray(1, 2, 3, 4, 5, 6),1] ![Range 1 to 6,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7),1] ![Range 1 to 7,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8),1] ![Range 1 to 8,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9),1] ![Range 1 to 9,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),1] (QueryTest.scala:163){code} {code:java} - SPARK-2693 udaf aggregates test *** FAILED *** Results do not match for query: Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]] Timezone Env: == Parsed Logical Plan == 'GlobalLimit 1 +- 'LocalLimit 1 +- 'Project [unresolvedalias('percentile('key, 'array(1, 1)), None)] +- 'UnresolvedRelation `src` == Analyzed Logical Plan == percentile(key, array(1, 1), 1): array GlobalLimit 1 +- LocalLimit 1 +- Aggregate [percentile(key#205098, cast(array(1, 1) as array), 1, 0, 0) AS percentile(key, array(1, 1), 1)#205101] +- SubqueryAlias `default`.`src` +- HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099] == Optimized Logical Plan == GlobalLimit 1 +- LocalLimit 1 +- Aggregate [percentile(key#205098, [1.0,1.0], 1, 0, 0) AS percentile(key, array(1, 1), 1)#205101] +- Project [key#205098] +- HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099] == Physical Plan == CollectLimit 1 +- ObjectHashAggregate(keys=[], functions=[percentile(key#205098, [1.0,1.0], 1, 0, 0)], output=[percentile(key, array(1, 1), 1)#205101]) +- Exchange SinglePartition +- ObjectHashAggregate(keys=[], functions=[partial_percentile(key#205098, [1.0,1.0], 1, 0, 0)], output=[buf#205104]) +- Scan hive default.src [key#205098], HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099] == Results == == Results == !== Correct Answer - 1 == == Spark Answer - 1 == !struct> struct> ![WrappedArray(498, 498)] [WrappedArray(498.0, 498.0)] (QueryTest.scala:163){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To u
[jira] [Assigned] (SPARK-25235) Merge the REPL code in Scala 2.11 and 2.12 branches
[ https://issues.apache.org/jira/browse/SPARK-25235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-25235: --- Assignee: DB Tsai > Merge the REPL code in Scala 2.11 and 2.12 branches > --- > > Key: SPARK-25235 > URL: https://issues.apache.org/jira/browse/SPARK-25235 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Affects Versions: 2.3.1 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > Using some reflection tricks to merge Scala 2.11 and 2.12 codebase -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16281) Implement parse_url SQL function
[ https://issues.apache.org/jira/browse/SPARK-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594169#comment-16594169 ] Apache Spark commented on SPARK-16281: -- User 'TomaszGaweda' has created a pull request for this issue: https://github.com/apache/spark/pull/22249 > Implement parse_url SQL function > > > Key: SPARK-16281 > URL: https://issues.apache.org/jira/browse/SPARK-16281 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Jian Wu >Priority: Major > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25255) Add getActiveSession to SparkSession in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-25255: Labels: starter (was: ) > Add getActiveSession to SparkSession in PySpark > --- > > Key: SPARK-25255 > URL: https://issues.apache.org/jira/browse/SPARK-25255 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > Labels: starter > > Add getActiveSession to PySpark session API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25255) Add getActiveSession to SparkSession in PySpark
holdenk created SPARK-25255: --- Summary: Add getActiveSession to SparkSession in PySpark Key: SPARK-25255 URL: https://issues.apache.org/jira/browse/SPARK-25255 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Reporter: holdenk Add getActiveSession to PySpark session API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25254) docker-image-tool should allow not building images for R and Python
Chaoran Yu created SPARK-25254: -- Summary: docker-image-tool should allow not building images for R and Python Key: SPARK-25254 URL: https://issues.apache.org/jira/browse/SPARK-25254 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.0 Reporter: Chaoran Yu Suppose a user does not need R, so he builds Spark using make-distribution.sh without R to generate a runnable distribution. Then he runs docker-image-tool.sh to generate Docker images based off of the distribution just produced in the previous step. And then docker-image-tool.sh fails because the R image fails to build. This is counter-intuitive. docker-image-tool.sh builds three images: regular, R and Python. make-distribution.sh allows users to specify whether to build Spark with R or Python, so should docker-image-tool.sh. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25253: Assignee: (was: Apache Spark) > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Minor > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25253: Assignee: Apache Spark > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Minor > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594055#comment-16594055 ] Apache Spark commented on SPARK-25253: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/22247 > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Minor > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple time
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-25250: -- Description: We recently had a scenario where a race condition occurred when a task from previous stage attempt just finished before new attempt for the same stage was created due to fetch failure, so the new task created in the second attempt on the same partition id was retrying multiple times due to TaskCommitDenied Exception without realizing that the task in earlier attempt was already successful. For example, consider a task with partition id 9000 and index 9000 running in stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. Just within this timespan, the above task completes successfully, thus, marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has not yet been created, the taskset info for that stage is not available to the TaskScheduler so, naturally, the partition id 9000 has not been marked completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same partition id 9000. This task fails due to CommitDeniedException and since, it does not see the corresponding partition id as been marked successful, it keeps retrying multiple times until the job finally succeeds. It doesn't cause any job failures because the DAG scheduler is tracking the partitions separate from the task set managers. Steps to Reproduce: # Run any large job involving shuffle operation. # When the ShuffleMap stage finishes and the ResultStage begins running, cause this stage to throw a fetch failure exception(Try deleting certain shuffle files on any host). # Observe the task attempt numbers for the next stage attempt. Please note that this issue is an intermittent one, so it might not happen all the time. was: We recently had a scenario where a race condition occurred when a task from previous stage attempt just finished before new attempt for the same stage was created due to fetch failure, so the new task created in the second attempt on the same partition id was retrying multiple times due to TaskCommitDenied Exception without realizing that the task in earlier attempt was already successful. For example, consider a task with partition id 9000 and index 9000 running in stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. Just within this timespan, the above task completes successfully, thus, marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has not yet been created, the taskset info for that stage is not available to the TaskScheduler so, naturally, the partition id 9000 has not been marked completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same partition id 9000. This task fails due to CommitDeniedException and since, it does not see the corresponding partition id as been marked successful, it keeps retrying multiple times. Steps to Reproduce: # Run any large job involving shuffle operation. # When the ShuffleMap stage finishes and the ResultStage begins running, cause this stage to throw a fetch failure exception(Try deleting certain shuffle files on any host). # Observe the task attempt numbers for the next stage attempt. Please note that this issue is an intermittent one, so it might not happen all the time. > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Priority: Major > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. St
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594046#comment-16594046 ] Henry Robinson commented on SPARK-24434: Yeah, assignees are set after the PR is merged. I think the idea is to prevent someone from assigning an issue to themselves, effectively taking a lock, and never actually making progress. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594045#comment-16594045 ] Stavros Kontopoulos commented on SPARK-24434: - Maybe it is added when the issue is resolved or something. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple time
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-25250: -- Priority: Major (was: Minor) > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Priority: Major > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594038#comment-16594038 ] Yinan Li commented on SPARK-24434: -- It seemed I couldn't change the assignee. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25235) Merge the REPL code in Scala 2.11 and 2.12 branches
[ https://issues.apache.org/jira/browse/SPARK-25235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594036#comment-16594036 ] Apache Spark commented on SPARK-25235: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/22246 > Merge the REPL code in Scala 2.11 and 2.12 branches > --- > > Key: SPARK-25235 > URL: https://issues.apache.org/jira/browse/SPARK-25235 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Affects Versions: 2.3.1 >Reporter: DB Tsai >Priority: Major > > Using some reflection tricks to merge Scala 2.11 and 2.12 codebase -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25235) Merge the REPL code in Scala 2.11 and 2.12 branches
[ https://issues.apache.org/jira/browse/SPARK-25235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25235: Assignee: (was: Apache Spark) > Merge the REPL code in Scala 2.11 and 2.12 branches > --- > > Key: SPARK-25235 > URL: https://issues.apache.org/jira/browse/SPARK-25235 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Affects Versions: 2.3.1 >Reporter: DB Tsai >Priority: Major > > Using some reflection tricks to merge Scala 2.11 and 2.12 codebase -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25235) Merge the REPL code in Scala 2.11 and 2.12 branches
[ https://issues.apache.org/jira/browse/SPARK-25235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25235: Assignee: Apache Spark > Merge the REPL code in Scala 2.11 and 2.12 branches > --- > > Key: SPARK-25235 > URL: https://issues.apache.org/jira/browse/SPARK-25235 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Affects Versions: 2.3.1 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > > Using some reflection tricks to merge Scala 2.11 and 2.12 codebase -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594024#comment-16594024 ] Stavros Kontopoulos commented on SPARK-24434: - Thanks [~liyinan926] I am reviewing the PR. Still this is not assigned though, pls assign to [~yifeih]. Anyway, this is not about who does what but about being fair. Never understood the intentions here. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25253) Refactor pyspark connection & authentication
Imran Rashid created SPARK-25253: Summary: Refactor pyspark connection & authentication Key: SPARK-25253 URL: https://issues.apache.org/jira/browse/SPARK-25253 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Reporter: Imran Rashid We've got a few places in pyspark that connect to local sockets, with varying levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25252) Support arrays of any types in to_json
[ https://issues.apache.org/jira/browse/SPARK-25252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25252: Assignee: Apache Spark > Support arrays of any types in to_json > -- > > Key: SPARK-25252 > URL: https://issues.apache.org/jira/browse/SPARK-25252 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Need to improve the to_json function and make it more consistent with > from_json by supporting arrays of any types (as root types). For now, it > supports only arrays of structs and arrays of maps. After the changes the > following code should work: > {code:scala} > select to_json(array('1','2','3')) > > ["1","2","3"] > select to_json(array(array(1,2,3),array(4))) > > [[1,2,3],[4]] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25252) Support arrays of any types in to_json
[ https://issues.apache.org/jira/browse/SPARK-25252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25252: Assignee: (was: Apache Spark) > Support arrays of any types in to_json > -- > > Key: SPARK-25252 > URL: https://issues.apache.org/jira/browse/SPARK-25252 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to improve the to_json function and make it more consistent with > from_json by supporting arrays of any types (as root types). For now, it > supports only arrays of structs and arrays of maps. After the changes the > following code should work: > {code:scala} > select to_json(array('1','2','3')) > > ["1","2","3"] > select to_json(array(array(1,2,3),array(4))) > > [[1,2,3],[4]] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25252) Support arrays of any types in to_json
[ https://issues.apache.org/jira/browse/SPARK-25252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594005#comment-16594005 ] Apache Spark commented on SPARK-25252: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/6 > Support arrays of any types in to_json > -- > > Key: SPARK-25252 > URL: https://issues.apache.org/jira/browse/SPARK-25252 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to improve the to_json function and make it more consistent with > from_json by supporting arrays of any types (as root types). For now, it > supports only arrays of structs and arrays of maps. After the changes the > following code should work: > {code:scala} > select to_json(array('1','2','3')) > > ["1","2","3"] > select to_json(array(array(1,2,3),array(4))) > > [[1,2,3],[4]] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25252) Support arrays of any types in to_json
Maxim Gekk created SPARK-25252: -- Summary: Support arrays of any types in to_json Key: SPARK-25252 URL: https://issues.apache.org/jira/browse/SPARK-25252 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to improve the to_json function and make it more consistent with from_json by supporting arrays of any types (as root types). For now, it supports only arrays of structs and arrays of maps. After the changes the following code should work: {code:scala} select to_json(array('1','2','3')) > ["1","2","3"] select to_json(array(array(1,2,3),array(4))) > [[1,2,3],[4]] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24882) data source v2 API improvement
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593951#comment-16593951 ] Apache Spark commented on SPARK-24882: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/22245 > data source v2 API improvement > -- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 API, isolate the stateull part of the API, think of better naming > of some interfaces. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25249) Add a unit test for OpenHashMap
[ https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25249: - Assignee: liuxian > Add a unit test for OpenHashMap > --- > > Key: SPARK-25249 > URL: https://issues.apache.org/jira/browse/SPARK-25249 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: liuxian >Assignee: liuxian >Priority: Trivial > Fix For: 2.4.0 > > > Adding a unit test for OpenHashMap , this can help developers to distinguish > between the 0/0.0/0L and non-exist value > {color:#629755} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25249) Add a unit test for OpenHashMap
[ https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25249. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22241 [https://github.com/apache/spark/pull/22241] > Add a unit test for OpenHashMap > --- > > Key: SPARK-25249 > URL: https://issues.apache.org/jira/browse/SPARK-25249 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: liuxian >Assignee: liuxian >Priority: Trivial > Fix For: 2.4.0 > > > Adding a unit test for OpenHashMap , this can help developers to distinguish > between the 0/0.0/0L and non-exist value > {color:#629755} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25249) Add a unit test for OpenHashMap
[ https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25249: -- Priority: Trivial (was: Minor) I don't think this kind of thing needs a JIRA > Add a unit test for OpenHashMap > --- > > Key: SPARK-25249 > URL: https://issues.apache.org/jira/browse/SPARK-25249 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Trivial > > Adding a unit test for OpenHashMap , this can help developers to distinguish > between the 0/0.0/0L and non-exist value > {color:#629755} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25240) A deadlock in ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25240: Target Version/s: 2.4.0 > A deadlock in ALTER TABLE RECOVER PARTITIONS > > > Key: SPARK-25240 > URL: https://issues.apache.org/jira/browse/SPARK-25240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Blocker > > Recover Partitions in ALTER TABLE is performed in recursive way by calling > the scanPartitions() method. scanPartitions() lists files sequentially or in > parallel if the > [condition|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L685] > is true: > {code:scala} > partitionNames.length > 1 && statuses.length > threshold || > partitionNames.length > 2 > {code} > Parallel listening is executed on [the fixed thread > pool|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L622] > which can have 8 threads in total. Dead lock occurs when all 8 cores have > been already occupied and recursive call of scanPartitions() submits new > parallel file listening. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25240) A deadlock in ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25240: Priority: Blocker (was: Major) > A deadlock in ALTER TABLE RECOVER PARTITIONS > > > Key: SPARK-25240 > URL: https://issues.apache.org/jira/browse/SPARK-25240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Blocker > > Recover Partitions in ALTER TABLE is performed in recursive way by calling > the scanPartitions() method. scanPartitions() lists files sequentially or in > parallel if the > [condition|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L685] > is true: > {code:scala} > partitionNames.length > 1 && statuses.length > threshold || > partitionNames.length > 2 > {code} > Parallel listening is executed on [the fixed thread > pool|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L622] > which can have 8 threads in total. Dead lock occurs when all 8 cores have > been already occupied and recursive call of scanPartitions() submits new > parallel file listening. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25251) Make spark-csv's `quote` and `escape` options conform to RFC 4180
Ruslan Dautkhanov created SPARK-25251: - Summary: Make spark-csv's `quote` and `escape` options conform to RFC 4180 Key: SPARK-25251 URL: https://issues.apache.org/jira/browse/SPARK-25251 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.3.1, 2.3.0, 2.4.0, 3.0.0 Reporter: Ruslan Dautkhanov As described in [RFC-4180|https://tools.ietf.org/html/rfc4180], page 2 - {noformat} 7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote {noformat} That's what Excel does, for example, by default. Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark to use doublequote to use for as an escape character: {code} .option('quote', '"') .option('escape', '"') {code} This may explain that a comma character wasn't interpreted as it was inside a quoted column. So this is request to make spark-csv reader RFC-4180 compatible in regards to default option values for `quote` and `escape` (make both equal to " ). Since this is a backward-incompatible change, Spark 3.0 might be a good release for this change. Some more background - https://stackoverflow.com/a/45138591/470583 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593919#comment-16593919 ] Yunling Cai commented on SPARK-25091: - Thanks [~Chao Fang] for working on this! I have changed the ticket title. Quick question: does this mean this is just a UI issue where executor information was shown incorrectly? As we saw the cache tables start falling onto disk even though we have uncache the previous copy of it. We also started seeing duplicate entries on the storage tab for same table and this is why we think the memory clean up may have actual problems. Steps to reproduce: CACHE TABLE A UNCACHE TABLE A CACHE TABLE A REFRESH TABLE has a similar behavior. Thanks! > UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory > - > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593913#comment-16593913 ] Xiao Li commented on SPARK-23874: - ping [~bryanc] again. > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yunling Cai updated SPARK-25091: Summary: UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory (was: Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor memory) > UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory > - > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25091) Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593908#comment-16593908 ] Chao Fang commented on SPARK-25091: --- hi [~dongjoon], Class AppStatusListener use LiveRDD and LiveExecutor to keep tracking of RDD and Executor info respectively. Though LiveRDD tracks RDD info correctly no matter we cache or unpersist the RDD, LiveExecutor fails to track the Executor info when we unpersist RDD, that is why Executor Tab in SparkUI does not change when we unpersist rdd or UNCACHE Table. I would soon create pr for this issue. But I don't know how to change the title. > Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor > memory > > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593200#comment-16593200 ] Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 3:53 PM: --- Also here is similar investigation I did for parquet tables. Just for your information: [https://github.com/apache/spark/pull/22184/files#r212405373] was (Author: seancxmao): Also here is the similar investigation I did for parquet tables. Just for your information: https://github.com/apache/spark/pull/22184/files#r212405373 > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the lower case field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. When there > are duplicate fields (e.g. c, C), we just can't read hive serde tables. If > there are no duplicate fields, hive serde tables always do case-insensitive > field resolution regardless of case sensitivity mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows
[ https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593858#comment-16593858 ] Apache Spark commented on SPARK-25213: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22244 > DataSourceV2 doesn't seem to produce unsafe rows > - > > Key: SPARK-25213 > URL: https://issues.apache.org/jira/browse/SPARK-25213 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Li Jin >Priority: Major > > Reproduce (Need to compile test-classes): > bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes > {code:java} > datasource_v2_df = spark.read \ > .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") > \ > .load() > result = datasource_v2_df.withColumn('x', udf(lambda x: x, > 'int')(datasource_v2_df['i'])) > result.show() > {code} > The above code fails with: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > {code} > Seems like Data Source V2 doesn't produce unsafeRows here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources
[ https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593857#comment-16593857 ] Apache Spark commented on SPARK-24721: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22244 > Failed to use PythonUDF with literal inputs in filter with data sources > --- > > Key: SPARK-24721 > URL: https://issues.apache.org/jira/browse/SPARK-24721 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > import random > from pyspark.sql.functions import * > from pyspark.sql.types import * > def random_probability(label): > if label == 1.0: > return random.uniform(0.5, 1.0) > else: > return random.uniform(0.0, 0.4999) > def randomize_label(ratio): > > if random.random() >= ratio: > return 1.0 > else: > return 0.0 > random_probability = udf(random_probability, DoubleType()) > randomize_label = udf(randomize_label, DoubleType()) > spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3") > babydf = spark.read.csv("/tmp/tab3") > data_modified_label = babydf.withColumn( > 'random_label', randomize_label(lit(1 - 0.1)) > ) > data_modified_random = data_modified_label.withColumn( > 'random_probability', > random_probability(col('random_label')) > ) > data_modified_label.filter(col('random_label') == 0).show() > {code} > The above code will generate the following exception: > {code} > Py4JJavaError: An error occurred while calling o446.showString. > : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), > requires attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sq
[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25206: Target Version/s: 2.3.2, 2.4.0 (was: 2.3.2) > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593840#comment-16593840 ] Xiao Li commented on SPARK-25206: - Silently ignoring it is bad. We should issue an exception like what we did in the previous version? What do you think? > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25227) Extend functionality of to_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25227. -- Resolution: Duplicate > Extend functionality of to_json to support arrays of differently-typed > elements > --- > > Key: SPARK-25227 > URL: https://issues.apache.org/jira/browse/SPARK-25227 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'to_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'to_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593819#comment-16593819 ] Hyukjin Kwon commented on SPARK-25226: -- This is fixed in the current master: {code} >>> df = df.withColumn("parsed_data", F.from_json(F.col('data'), ... ArrayType(StringType( # Does not work, because not a struct of array of structs >>> df.show() ++---++ |data| id| parsed_data| ++---++ |["string1", true,...| 1|[string1, true,]| |["string2", false...| 2| [string2, false,]| |["string3", true,...| 3|[string3, true, a...| ++---++ {code} Catalog string is preferred over JSON string support, which should rather be deprecated. > Extend functionality of from_json to support arrays of differently-typed > elements > - > > Key: SPARK-25226 > URL: https://issues.apache.org/jira/browse/SPARK-25226 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'from_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25226. -- Resolution: Won't Fix > Extend functionality of from_json to support arrays of differently-typed > elements > - > > Key: SPARK-25226 > URL: https://issues.apache.org/jira/browse/SPARK-25226 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'from_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593816#comment-16593816 ] Hyukjin Kwon commented on SPARK-25226: -- can you use: {code} >>> df = df.withColumn("parsed_data", F.from_json(F.col('data'), >>> "array")) >>> df.show() ++---++ |data| id| parsed_data| ++---++ |["string1", true,...| 1|[string1, true,]| |["string2", false...| 2| [string2, false,]| |["string3", true,...| 3|[string3, true, a...| ++---++ {code} instead? > Extend functionality of from_json to support arrays of differently-typed > elements > - > > Key: SPARK-25226 > URL: https://issues.apache.org/jira/browse/SPARK-25226 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'from_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25225) Add support for "List"-Type columns
[ https://issues.apache.org/jira/browse/SPARK-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593810#comment-16593810 ] Hyukjin Kwon commented on SPARK-25225: -- {quote} I coul cast it to string, but then later, when I convert it to JSON it will receive quotation marks around it, and I don't want that, I want whatever client software will read this JSON, to read an integer and not a string. {quote} Would you mind if I ask to clarify this? > Add support for "List"-Type columns > --- > > Key: SPARK-25225 > URL: https://issues.apache.org/jira/browse/SPARK-25225 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, Spark Dataframe ArrayType-columns only support all elements of > the array being of same data type. > At our company, we are currently rewriting old MapReduce code with Spark. One > of the frequent use-cases is aggregating data into timeseries: > Example input: > {noformat} > IDdatedata > 1 2017-01-01 data_1_1 > 1 2018-02-02 data_1_2 > 2 2017-03-03 data_2_1 > 3 2018-04-04 data 2_2 > ... > {noformat} > Expected outpus: > {noformat} > IDtimeseries > 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]] > 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]] > ... > {noformat} > Here, the values in the data column of the input are, in most cases, not > primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, > does not support creating an array column of a string column and a non-string > column. > We would like to kindly ask you to implement one of the following: > 1. Extend ArrayType to support elements of different data type > 2. Introduce a new container type (ListType?) which would support elements of > different type > UPDATE: The background here is, that I want to be able to parse JSON-arrays > of differently-typed elements into SPARK Dataframe columns, as well as create > JSON arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple time
Parth Gandhi created SPARK-25250: Summary: Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times Key: SPARK-25250 URL: https://issues.apache.org/jira/browse/SPARK-25250 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 2.3.1 Reporter: Parth Gandhi We recently had a scenario where a race condition occurred when a task from previous stage attempt just finished before new attempt for the same stage was created due to fetch failure, so the new task created in the second attempt on the same partition id was retrying multiple times due to TaskCommitDenied Exception without realizing that the task in earlier attempt was already successful. For example, consider a task with partition id 9000 and index 9000 running in stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. Just within this timespan, the above task completes successfully, thus, marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has not yet been created, the taskset info for that stage is not available to the TaskScheduler so, naturally, the partition id 9000 has not been marked completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same partition id 9000. This task fails due to CommitDeniedException and since, it does not see the corresponding partition id as been marked successful, it keeps retrying multiple times. Steps to Reproduce: # Run any large job involving shuffle operation. # When the ShuffleMap stage finishes and the ResultStage begins running, cause this stage to throw a fetch failure exception(Try deleting certain shuffle files on any host). # Observe the task attempt numbers for the next stage attempt. Please note that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'
[ https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593445#comment-16593445 ] Bang Xiao commented on SPARK-24009: --- any progress here? i met the same error sql: INSERT OVERWRITE LOCAL DIRECTORY '/search/odin/test' row format delimited FIELDS TERMINATED BY '\t' select vrid, query, url, loc_city from custom.common_wap_vr where logdate >= '2018073000' and logdate <= '2018073023' and vrid = '11000801' group by vrid,query, loc_city,url; spark command is : spark-sql --master yarn --deploy-mode client -f test.sql *hive.exec.scratchdir* : /user/hive/datadir-tmp *hive.exec.stagingdir* : /user/hive/datadir-tmp 18/08/27 17:16:21 ERROR util.Utils: Aborting task org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create file:/user/hive/datadir-tmp_hive_2018-08-27_17-14-45_908_2829491226961893146-1/-ext-1/_temporary/0/_temporary/attempt_20180827171619_0002_m_00_0 (exists=false, cwd=file:/search/hadoop09/yarn_local/usercache/ultraman/appcache/application_1535079600137_133521/container_e09_1535079600137_133521_01_51) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Mkdirs failed to create file:/user/hive/datadir-tmp_hive_2018-08-27_17-14-45_908_2829491226961893146-1/-ext-1/_temporary/0/_temporary/attempt_20180827171619_0002_m_00_0 (exists=false, cwd=file:/search/hadoop09/yarn_local/usercache/ultraman/appcache/application_1535079600137_133521/container_e09_1535079600137_133521_01_51) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:925) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:818) at org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246) > spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' > - > > Key: SPARK-24009 > URL: https://issues.apache.org/jira/browse/SPARK-24009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: chris_j >Priority: Major > > local mode spark execute "INSERT OVERWRITE LOCAL DIRECTORY " successfully. > on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not > permission problem also > > 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row > format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from > default.dim_date" write local directory successful > 2.spark-sql --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format > delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from > default.dim_da
[jira] [Commented] (SPARK-25227) Extend functionality of to_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593368#comment-16593368 ] Maxim Gekk commented on SPARK-25227: > I don't know about to_json. Maybe Maxim Gekk can comment more on that. Here is the PR for that: https://github.com/apache/spark/pull/6 . Please, review it. > Extend functionality of to_json to support arrays of differently-typed > elements > --- > > Key: SPARK-25227 > URL: https://issues.apache.org/jira/browse/SPARK-25227 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'to_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'to_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25227) Extend functionality of to_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593286#comment-16593286 ] Yuriy Davygora commented on SPARK-25227: [~hyukjin.kwon] I only know, that in the upcoming release, the from_json function will support arrays of primitive types. I don't know about to_json. Maybe [~maxgekk] can comment more on that. As for the code, I cannot provide you with a snippet, because I cannot even reach the stage where I would use the to_json function. Let's say I have a dataframe with two columns: "string" which is of type string and "int" which is of integer type. I cannot even do: {noformat} df = df.withColumn("new_column", F.array("string", "int")) {noformat} because Spark ArrayType does not support elements of different type. If this were otherwise, next step for me would have been smth. like {noformat} df.select(F.to_json("new_column")) {noformat} See also [[SPARK-25226]] > Extend functionality of to_json to support arrays of differently-typed > elements > --- > > Key: SPARK-25227 > URL: https://issues.apache.org/jira/browse/SPARK-25227 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'to_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'to_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593283#comment-16593283 ] Yuriy Davygora commented on SPARK-25226: [~hyukjin.kwon] Yes, sure. Here is some simple Python code to reproduce the problems (using pyspark 2.3.1 and pandas 0.23.4): {noformat} import pandas as pd from pyspark.sql import SparkSession import pyspark.sql.functions as F from pyspark.sql.types import StringType, ArrayType spark = SparkSession.builder.appName('test').getOrCreate() data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", false, null]', '["string3", true, "another_string3"]']} pdf = pd.DataFrame.from_dict(data) df = spark.createDataFrame(pdf) df.show() df = df.withColumn("parsed_data", F.from_json(F.col('data'), ArrayType(StringType( # Does not work, because not a struct of array of structs df = df.withColumn("parsed_data", F.from_json(F.col('data'), '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) # Does not work at all {noformat} > Extend functionality of from_json to support arrays of differently-typed > elements > - > > Key: SPARK-25226 > URL: https://issues.apache.org/jira/browse/SPARK-25226 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'from_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25225) Add support for "List"-Type columns
[ https://issues.apache.org/jira/browse/SPARK-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593282#comment-16593282 ] Yuriy Davygora commented on SPARK-25225: [~maropu] Sorry, I was not quite clear on the ultimate purpose, which is to convert such columns to JSON arrays and also convert JSON arrays to Spark Dataframe columns. [~hyukjin.kwon] Let's say, that in the above example, data is, say, just an integer. I coul cast it to string, but then later, when I convert it to JSON it will receive quotation marks around it, and I don't want that, I want whatever client software will read this JSON, to read an integer and not a string. Besides, in most cases, 'data' is not a primitive typed column. For example, in the code that I am working on right now, it is an array of arrays of integers, which cannot be cast to a string that easily. > Add support for "List"-Type columns > --- > > Key: SPARK-25225 > URL: https://issues.apache.org/jira/browse/SPARK-25225 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, Spark Dataframe ArrayType-columns only support all elements of > the array being of same data type. > At our company, we are currently rewriting old MapReduce code with Spark. One > of the frequent use-cases is aggregating data into timeseries: > Example input: > {noformat} > IDdatedata > 1 2017-01-01 data_1_1 > 1 2018-02-02 data_1_2 > 2 2017-03-03 data_2_1 > 3 2018-04-04 data 2_2 > ... > {noformat} > Expected outpus: > {noformat} > IDtimeseries > 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]] > 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]] > ... > {noformat} > Here, the values in the data column of the input are, in most cases, not > primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, > does not support creating an array column of a string column and a non-string > column. > We would like to kindly ask you to implement one of the following: > 1. Extend ArrayType to support elements of different data type > 2. Introduce a new container type (ListType?) which would support elements of > different type > UPDATE: The background here is, that I want to be able to parse JSON-arrays > of differently-typed elements into SPARK Dataframe columns, as well as create > JSON arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-25175: - Description: SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues. Since Spark has 2 OrcFileFormat, we should add support for both. * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity mode. When there is ambiguity, hive OrcFileFormat always returns the lower case field, rather than failing the reading operation. * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields. Besides data source tables, hive serde tables also have issues. When there are duplicate fields (e.g. c, C), we just can't read hive serde tables. If there are no duplicate fields, hive serde tables always do case-insensitive field resolution regardless of case sensitivity mode. was: SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues. Since Spark has 2 OrcFileFormat, we should add support for both. * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity mode. When there is ambiguity, hive OrcFileFormat always returns the lower case field, rather than failing the reading operation. * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields. Besides data source tables, hive serde tables also have issues. When there is ambiguity, we just can't read hive serde tables. > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the lower case field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. When there > are duplicate fields (e.g. c, C), we just can't read hive serde tables. If > there are no duplicate fields, hive serde tables always do case-insensitive > field resolution regardless of case sensitivity mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25225) Add support for "List"-Type columns
[ https://issues.apache.org/jira/browse/SPARK-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy Davygora updated SPARK-25225: --- Description: At the moment, Spark Dataframe ArrayType-columns only support all elements of the array being of same data type. At our company, we are currently rewriting old MapReduce code with Spark. One of the frequent use-cases is aggregating data into timeseries: Example input: {noformat} ID datedata 1 2017-01-01 data_1_1 1 2018-02-02 data_1_2 2 2017-03-03 data_2_1 3 2018-04-04 data 2_2 ... {noformat} Expected outpus: {noformat} ID timeseries 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]] 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]] ... {noformat} Here, the values in the data column of the input are, in most cases, not primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, does not support creating an array column of a string column and a non-string column. We would like to kindly ask you to implement one of the following: 1. Extend ArrayType to support elements of different data type 2. Introduce a new container type (ListType?) which would support elements of different type UPDATE: The background here is, that I want to be able to parse JSON-arrays of differently-typed elements into SPARK Dataframe columns, as well as create JSON arrays from such columns. was: At the moment, Spark Dataframe ArrayType-columns only support all elements of the array being of same data type. At our company, we are currently rewriting old MapReduce code with Spark. One of the frequent use-cases is aggregating data into timeseries: Example input: {noformat} ID datedata 1 2017-01-01 data_1_1 1 2018-02-02 data_1_2 2 2017-03-03 data_2_1 3 2018-04-04 data 2_2 ... {noformat} Expected outpus: {noformat} ID timeseries 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]] 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]] ... {noformat} Here, the values in the data column of the input are, in most cases, not primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, does not support creating an array column of a string column and a non-string column. We would like to kindly ask you to implement one of the following: 1. Extend ArrayType to support elements of different data type 2. Introduce a new container type (ListType?) which would support elements of different type > Add support for "List"-Type columns > --- > > Key: SPARK-25225 > URL: https://issues.apache.org/jira/browse/SPARK-25225 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, Spark Dataframe ArrayType-columns only support all elements of > the array being of same data type. > At our company, we are currently rewriting old MapReduce code with Spark. One > of the frequent use-cases is aggregating data into timeseries: > Example input: > {noformat} > IDdatedata > 1 2017-01-01 data_1_1 > 1 2018-02-02 data_1_2 > 2 2017-03-03 data_2_1 > 3 2018-04-04 data 2_2 > ... > {noformat} > Expected outpus: > {noformat} > IDtimeseries > 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]] > 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]] > ... > {noformat} > Here, the values in the data column of the input are, in most cases, not > primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, > does not support creating an array column of a string column and a non-string > column. > We would like to kindly ask you to implement one of the following: > 1. Extend ArrayType to support elements of different data type > 2. Introduce a new container type (ListType?) which would support elements of > different type > UPDATE: The background here is, that I want to be able to parse JSON-arrays > of differently-typed elements into SPARK Dataframe columns, as well as create > JSON arrays from such columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25225) Add support for "List"-Type columns
[ https://issues.apache.org/jira/browse/SPARK-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy Davygora updated SPARK-25225: --- Description: At the moment, Spark Dataframe ArrayType-columns only support all elements of the array being of same data type. At our company, we are currently rewriting old MapReduce code with Spark. One of the frequent use-cases is aggregating data into timeseries: Example input: {noformat} ID datedata 1 2017-01-01 data_1_1 1 2018-02-02 data_1_2 2 2017-03-03 data_2_1 3 2018-04-04 data 2_2 ... {noformat} Expected outpus: {noformat} ID timeseries 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]] 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]] ... {noformat} Here, the values in the data column of the input are, in most cases, not primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, does not support creating an array column of a string column and a non-string column. We would like to kindly ask you to implement one of the following: 1. Extend ArrayType to support elements of different data type 2. Introduce a new container type (ListType?) which would support elements of different type UPDATE: The background here is, that I want to be able to parse JSON-arrays of differently-typed elements into SPARK Dataframe columns, as well as create JSON arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]] was: At the moment, Spark Dataframe ArrayType-columns only support all elements of the array being of same data type. At our company, we are currently rewriting old MapReduce code with Spark. One of the frequent use-cases is aggregating data into timeseries: Example input: {noformat} ID datedata 1 2017-01-01 data_1_1 1 2018-02-02 data_1_2 2 2017-03-03 data_2_1 3 2018-04-04 data 2_2 ... {noformat} Expected outpus: {noformat} ID timeseries 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]] 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]] ... {noformat} Here, the values in the data column of the input are, in most cases, not primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, does not support creating an array column of a string column and a non-string column. We would like to kindly ask you to implement one of the following: 1. Extend ArrayType to support elements of different data type 2. Introduce a new container type (ListType?) which would support elements of different type UPDATE: The background here is, that I want to be able to parse JSON-arrays of differently-typed elements into SPARK Dataframe columns, as well as create JSON arrays from such columns. > Add support for "List"-Type columns > --- > > Key: SPARK-25225 > URL: https://issues.apache.org/jira/browse/SPARK-25225 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, Spark Dataframe ArrayType-columns only support all elements of > the array being of same data type. > At our company, we are currently rewriting old MapReduce code with Spark. One > of the frequent use-cases is aggregating data into timeseries: > Example input: > {noformat} > IDdatedata > 1 2017-01-01 data_1_1 > 1 2018-02-02 data_1_2 > 2 2017-03-03 data_2_1 > 3 2018-04-04 data 2_2 > ... > {noformat} > Expected outpus: > {noformat} > IDtimeseries > 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]] > 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]] > ... > {noformat} > Here, the values in the data column of the input are, in most cases, not > primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, > does not support creating an array column of a string column and a non-string > column. > We would like to kindly ask you to implement one of the following: > 1. Extend ArrayType to support elements of different data type > 2. Introduce a new container type (ListType?) which would support elements of > different type > UPDATE: The background here is, that I want to be able to parse JSON-arrays > of differently-typed elements into SPARK Dataframe columns, as well as create > JSON arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25227) Extend functionality of to_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy Davygora updated SPARK-25227: --- Summary: Extend functionality of to_json to support arrays of differently-typed elements (was: Extend functionality of to_json) > Extend functionality of to_json to support arrays of differently-typed > elements > --- > > Key: SPARK-25227 > URL: https://issues.apache.org/jira/browse/SPARK-25227 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'to_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'to_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy Davygora updated SPARK-25226: --- Summary: Extend functionality of from_json to support arrays of differently-typed elements (was: Extend functionality of from_json) > Extend functionality of from_json to support arrays of differently-typed > elements > - > > Key: SPARK-25226 > URL: https://issues.apache.org/jira/browse/SPARK-25226 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'from_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593194#comment-16593194 ] Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 7:50 AM: --- [~dongjoon] [~yucai] Here is a brief summary. We can see that * The data source tables with hive impl always return a,B,c, no matter whether spark.sql.caseSensitive is set to true or false and no matter metastore table schema is in lower case or upper case. It seems they always do case-insensitive field resolution, and if there is ambiguity they return the lower case ones. Given ORC file schema is (a,B,c,C) ** Is it better to return null in scenario 2 and 10? ** Is it better to return C in scenario 12? ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather than always return lower case one? * The data source tables with native impl, compared to hive impl, handle scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in the same way as hive impl. * The hive serde tables always throw IndexOutOfBoundsException at runtime when there are duplicate fields (e.g. c, C). If there is no duplicate fields, it seems hive serde tables always do case-insensitive field resolution, just like hive implementation of OrcFileFormat. * Since in case-sensitive mode analysis should fail if a column name in query and metastore schema are in different cases, all AnalysisException(s) meet our expectation. Stacktrace of IndexOutOfBoundsException: {code:java} java.lang.IndexOutOfBoundsException: toIndex = 4 at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) at java.util.ArrayList.subList(ArrayList.java:996) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} was (Author: seancxmao): [~dongjoon] [~yucai] Here is a brief summary
[jira] [Resolved] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.
[ https://issues.apache.org/jira/browse/SPARK-24978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24978. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21931 [https://github.com/apache/spark/pull/21931] > Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity > of fast aggregation. > - > > Key: SPARK-24978 > URL: https://issues.apache.org/jira/browse/SPARK-24978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Major > Fix For: 2.4.0 > > > this pr add a configuration parameter to configure the capacity of fast > aggregation. > Performance comparison: > /* > Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 > Intel64 Family 6 Model 94 Stepping 3, GenuineIntel > Aggregate w multiple keys: Best/Avg Time(ms) Rate(M/s) > Per Row(ns) Relative > > > fasthash = default 5612 / 5882 3.7 > 267.6 1.0X > fasthash = config 3586 / 3595 5.8 > 171.0 1.6X > */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.
[ https://issues.apache.org/jira/browse/SPARK-24978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24978: --- Assignee: caoxuewen > Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity > of fast aggregation. > - > > Key: SPARK-24978 > URL: https://issues.apache.org/jira/browse/SPARK-24978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Major > > this pr add a configuration parameter to configure the capacity of fast > aggregation. > Performance comparison: > /* > Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 > Intel64 Family 6 Model 94 Stepping 3, GenuineIntel > Aggregate w multiple keys: Best/Avg Time(ms) Rate(M/s) > Per Row(ns) Relative > > > fasthash = default 5612 / 5882 3.7 > 267.6 1.0X > fasthash = config 3586 / 3595 5.8 > 171.0 1.6X > */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25249) Add a unit test for OpenHashMap
[ https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593268#comment-16593268 ] Apache Spark commented on SPARK-25249: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/22241 > Add a unit test for OpenHashMap > --- > > Key: SPARK-25249 > URL: https://issues.apache.org/jira/browse/SPARK-25249 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > Adding a unit test for OpenHashMap , this can help developers to distinguish > between the 0/0.0/0L and non-exist value > {color:#629755} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25249) Add a unit test for OpenHashMap
[ https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25249: Assignee: (was: Apache Spark) > Add a unit test for OpenHashMap > --- > > Key: SPARK-25249 > URL: https://issues.apache.org/jira/browse/SPARK-25249 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > Adding a unit test for OpenHashMap , this can help developers to distinguish > between the 0/0.0/0L and non-exist value > {color:#629755} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25249) Add a unit test for OpenHashMap
[ https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25249: Assignee: Apache Spark > Add a unit test for OpenHashMap > --- > > Key: SPARK-25249 > URL: https://issues.apache.org/jira/browse/SPARK-25249 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: liuxian >Assignee: Apache Spark >Priority: Minor > > Adding a unit test for OpenHashMap , this can help developers to distinguish > between the 0/0.0/0L and non-exist value > {color:#629755} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185 ] Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 7:33 AM: --- Thorough investigation about ORC tables with duplicate fields (c and C). {code:java} val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") $> hive --orcfiledump /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Structure for /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Type: struct CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' DESC EXTENDED orc_data_source_lower; DESC EXTENDED orc_data_source_upper; DESC EXTENDED orc_hive_serde_lower; DESC EXTENDED orc_hive_serde_upper; spark.conf.set("spark.sql.hive.convertMetastoreOrc", false) {code} ||no.||caseSensitive||table columns||select column||orc column (select via data source table, hive impl)||orc column (select via data source table, native impl)||orc column (select via hive serde table)|| |1|true|a, b, c|a|a |a|IndexOutOfBoundsException | |2| | |b|B |null|IndexOutOfBoundsException | |3| | |c|c |c|IndexOutOfBoundsException | |4| | |A|AnalysisException|AnalysisException|AnalysisException| |5| | |B|AnalysisException|AnalysisException|AnalysisException| |6| | |C|AnalysisException|AnalysisException|AnalysisException| |7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException| |8| | |b|AnalysisException |AnalysisException|AnalysisException | |9| | |c|AnalysisException |AnalysisException|AnalysisException | |10| | |A|a |null|IndexOutOfBoundsException | |11| | |B|B |B|IndexOutOfBoundsException | |12| | |C|c |C|IndexOutOfBoundsException | |13|false|a, b, c|a|a |a|IndexOutOfBoundsException | |14| | |b|B |B|IndexOutOfBoundsException | |15| | |c|c |c|IndexOutOfBoundsException | |16| | |A|a |a|IndexOutOfBoundsException | |17| | |B|B |B|IndexOutOfBoundsException | |18| | |C|c |c|IndexOutOfBoundsException | |19| |A, B, C|a|a |a|IndexOutOfBoundsException | |20| | |b|B |B|IndexOutOfBoundsException | |21| | |c|c |c|IndexOutOfBoundsException | |22| | |A|a |a|IndexOutOfBoundsException | |23| | |B|B |B|IndexOutOfBoundsException | |24| | |C|c |c|IndexOutOfBoundsException | Followup tests that use ORC files with no duplicate fields (only a,B). {code:java} val data = spark.range(5).selectExpr("id as a", "id * 2 as B") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data_nodup") $> hive --orcfiledump /user/hive/warehouse/orc_data_nodup//user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc Structure for /user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc Type: struct CREATE TABLE orc_nodup_hive_serde_lower (a LONG, b LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data_nodup' CREATE TABLE orc_nodup_hive_serde_upper (A LONG, B LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data_nodup' DESC EXTENDED orc_nodup_hive_serde_lower; DESC EXTENDED orc_nodup_hive_serde_upper; spark.conf.set("spark.sql.hive.convertMetastoreOrc", false) {code} ||no.||caseSensitive||table columns||select column||orc column (select via hive serde table)|| |1|true|a, b|a|a| |2| | |b|B| |4| | |A|AnalysisException| |5| | |B|AnalysisException| |7| |A, B|a|AnalysisException| |8| | |b|AnalysisException | |10| | |A|a| |11| | |B|B| |13|false|a, b|a|a| |14| | |b|B| |16| | |A|a| |17| | |B|B| |19| |A, B|a|a| |20| | |b|B| |22| | |A|a| |23| | |B|B| was (Author: seancxmao): Thorough investigation about ORC tables with duplicate fields (c and C). {code:java} val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") $> hive --orcfiledump /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Structure for /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Type: struct CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_data_source_upper (A
[jira] [Created] (SPARK-25249) Add a unit test for OpenHashMap
liuxian created SPARK-25249: --- Summary: Add a unit test for OpenHashMap Key: SPARK-25249 URL: https://issues.apache.org/jira/browse/SPARK-25249 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.4.0 Reporter: liuxian Adding a unit test for OpenHashMap , this can help developers to distinguish between the 0/0.0/0L and non-exist value {color:#629755} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185 ] Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 7:24 AM: --- Thorough investigation about ORC tables with duplicate fields (c and C). {code:java} val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") $> hive --orcfiledump /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Structure for /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Type: struct CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' DESC EXTENDED orc_data_source_lower; DESC EXTENDED orc_data_source_upper; DESC EXTENDED orc_hive_serde_lower; DESC EXTENDED orc_hive_serde_upper; spark.conf.set("spark.sql.hive.convertMetastoreOrc", false) {code} ||no.||caseSensitive||table columns||select column||orc column (select via data source table, hive impl)||orc column (select via data source table, native impl)||orc column (select via hive serde table)|| |1|true|a, b, c|a|a |a|IndexOutOfBoundsException | |2| | |b|B |null|IndexOutOfBoundsException | |3| | |c|c |c|IndexOutOfBoundsException | |4| | |A|AnalysisException|AnalysisException|AnalysisException| |5| | |B|AnalysisException|AnalysisException|AnalysisException| |6| | |C|AnalysisException|AnalysisException|AnalysisException| |7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException| |8| | |b|AnalysisException |AnalysisException|AnalysisException | |9| | |c|AnalysisException |AnalysisException|AnalysisException | |10| | |A|a |null|IndexOutOfBoundsException | |11| | |B|B |B|IndexOutOfBoundsException | |12| | |C|c |C|IndexOutOfBoundsException | |13|false|a, b, c|a|a |a|IndexOutOfBoundsException | |14| | |b|B |B|IndexOutOfBoundsException | |15| | |c|c |c|IndexOutOfBoundsException | |16| | |A|a |a|IndexOutOfBoundsException | |17| | |B|B |B|IndexOutOfBoundsException | |18| | |C|c |c|IndexOutOfBoundsException | |19| |A, B, C|a|a |a|IndexOutOfBoundsException | |20| | |b|B |B|IndexOutOfBoundsException | |21| | |c|c |c|IndexOutOfBoundsException | |22| | |A|a |a|IndexOutOfBoundsException | |23| | |B|B |B|IndexOutOfBoundsException | |24| | |C|c |c|IndexOutOfBoundsException | was (Author: seancxmao): Thorough investigation about ORC tables {code:java} val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") $> hive --orcfiledump /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Structure for /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Type: struct CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' DESC EXTENDED orc_data_source_lower; DESC EXTENDED orc_data_source_upper; DESC EXTENDED orc_hive_serde_lower; DESC EXTENDED orc_hive_serde_upper; spark.conf.set("spark.sql.hive.convertMetastoreOrc", false) {code} ||no.||caseSensitive||table columns||select column||orc column (select via data source table, hive impl)||orc column (select via data source table, native impl)||orc column (select via hive serde table)|| |1|true|a, b, c|a|a |a|IndexOutOfBoundsException | |2| | |b|B |null|IndexOutOfBoundsException | |3| | |c|c |c|IndexOutOfBoundsException | |4| | |A|AnalysisException|AnalysisException|AnalysisException| |5| | |B|AnalysisException|AnalysisException|AnalysisException| |6| | |C|AnalysisException|AnalysisException|AnalysisException| |7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException| |8| | |b|AnalysisException |AnalysisException|AnalysisException | |9| | |c|AnalysisException |AnalysisException|AnalysisException | |10| | |A|a |null|IndexOutOfBoundsException | |11| | |B|B |B|IndexOutOfBoundsException | |12| | |C|c |C|IndexOutOfB
[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-25175: - Description: SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues. Since Spark has 2 OrcFileFormat, we should add support for both. * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity mode. When there is ambiguity, hive OrcFileFormat always returns the lower case field, rather than failing the reading operation. * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields. Besides data source tables, hive serde tables also have issues. When there is ambiguity, we just can't read hive serde tables. was: SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues. Since Spark has 2 OrcFileFormat, we should add support for both. * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity mode. When there is ambiguity, hive OrcFileFormat always returns the lower case field, rather than failing the reading operation. * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields. Besides data source tables, hive serde tables also have issues. When there i > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the lower case field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. When there is > ambiguity, we just can't read hive serde tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-25175: - Description: SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues. Since Spark has 2 OrcFileFormat, we should add support for both. * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity mode. When there is ambiguity, hive OrcFileFormat always returns the lower case field, rather than failing the reading operation. * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields. Besides data source tables, hive serde tables also have issues. When there i was: SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues. Since Spark has 2 OrcFileFormat, we should add support for both. * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat doesn't support case-insensitive field resolution at all. * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields. > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the lower case field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. When there i -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org