[jira] [Assigned] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25259:


Assignee: (was: Apache Spark)

> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25259:


Assignee: Apache Spark

> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594606#comment-16594606
 ] 

Apache Spark commented on SPARK-25259:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22250

> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-27 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594563#comment-16594563
 ] 

Yuming Wang commented on SPARK-25259:
-

I'm working on.

> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-27 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-25259:
---

 Summary: Left/Right join support push down during-join predicates
 Key: SPARK-25259
 URL: https://issues.apache.org/jira/browse/SPARK-25259
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang


For example:
{code:sql}
create temporary view EMPLOYEE as select * from values
  ("10", "HAAS", "A00"),
  ("10", "THOMPSON", "B01"),
  ("30", "KWAN", "C01"),
  ("000110", "LUCCHESSI", "A00"),
  ("000120", "O'CONNELL", "A))"),
  ("000130", "QUINTANA", "C01")
  as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);

create temporary view DEPARTMENT as select * from values
  ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
  ("B01", "PLANNING", "20"),
  ("C01", "INFORMATION CENTER", "30"),
  ("D01", "DEVELOPMENT CENTER", null)
  as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);

create temporary view PROJECT as select * from values
  ("AD3100", "ADMIN SERVICES", "D01"),
  ("IF1000", "QUERY SERVICES", "C01"),
  ("IF2000", "USER EDUCATION", "E01"),
  ("MA2100", "WELD LINE AUDOMATION", "D01"),
  ("PL2100", "WELD LINE PLANNING", "01")
  as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
{code}

below SQL:
{code:sql}
SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
ON P.DEPTNO = D.DEPTNO
AND P.DEPTNO='E01';
{code}

can Optimized to:
{code:sql}
SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
ON P.DEPTNO = D.DEPTNO
AND P.DEPTNO='E01';
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25258) Upgrade kryo package to version 4.0.2+

2018-08-27 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594557#comment-16594557
 ] 

Yuming Wang commented on SPARK-25258:
-

I have submitted PR: https://github.com/apache/spark/pull/22179
cc [~srowen]

> Upgrade kryo package to version 4.0.2+
> --
>
> Key: SPARK-25258
> URL: https://issues.apache.org/jira/browse/SPARK-25258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
>
> Recently, we encountered a kryo performance issue in spark2.1.0, and the 
> issue affect all kryo below 4.0.2, so it seems that all spark version might 
> encounter this issue.
> Issue description:
> In shuffle write phase or some spilling operation, spark will use kryo 
> serializer to serialize data if `spark.serializer` is set to 
> `KryoSerializer`, however, when data contains some extremely large records, 
> kryoSerializer's MapReferenceResolver would be expand, and it's `reset` 
> method will take a long time to reset all items in writtenObjects table to 
> null.
> com.esotericsoftware.kryo.util.MapReferenceResolver
> {code:java}
> public void reset () {
>  readObjects.clear();
>  writtenObjects.clear();
> }
> public void clear () {
>  K[] keyTable = this.keyTable;
>  for (int i = capacity + stashSize; i-- > 0;)
>   keyTable[i] = null;
>  size = 0;
>  stashSize = 0;
> }
> {code}
> I checked the kryo project in github, and this issue seems fixed in 4.0.2+
> [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]
>  
> I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix 
> this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2018-08-27 Thread Damian Momot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594546#comment-16594546
 ] 

Damian Momot commented on SPARK-4502:
-

I can see that this ticket was closed, but by looking at 
[https://github.com/apache/spark/pull/21320] only very basic scenario is 
supported and feature itself is disabled by default

 

Are there any follow up tickets to track full feature implementation?

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Assignee: Michael Allman
>Priority: Critical
> Fix For: 2.4.0
>
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25256) Plan mismatch errors in Hive tests in 2.12

2018-08-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25256:
-
Shepherd:   (was: Sean Owen)

> Plan mismatch errors in Hive tests in 2.12
> --
>
> Key: SPARK-25256
> URL: https://issues.apache.org/jira/browse/SPARK-25256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> In Hive tests, in the Scala 2.12 build, still seeing a few failures that seem 
> to show mismatching schema inference. Not clear whether it's the same as 
> SPARK-25044. Examples:
> {code:java}
> - SPARK-5775 read array from partitioned_parquet_with_key_and_complextypes 
> *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'Project ['arrayField, 'p]
> +- 'Filter ('p = 1)
> +- 'UnresolvedRelation `partitioned_parquet_with_key_and_complextypes`
> == Analyzed Logical Plan ==
> arrayField: array, p: int
> Project [arrayField#82569, p#82570]
> +- Filter (p#82570 = 1)
> +- SubqueryAlias `default`.`partitioned_parquet_with_key_and_complextypes`
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Optimized Logical Plan ==
> Project [arrayField#82569, p#82570]
> +- Filter (isnotnull(p#82570) && (p#82570 = 1))
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Physical Plan ==
> *(1) Project [arrayField#82569, p#82570]
> +- *(1) FileScan parquet 
> default.partitioned_parquet_with_key_and_complextypes[arrayField#82569,p#82570]
>  Batched: false, Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/home/srowen/spark-2.12/sql/hive/target/tmp/spark-d8d87d74-33e7-4f22...,
>  PartitionCount: 1, PartitionFilters: [isnotnull(p#82570), (p#82570 = 1)], 
> PushedFilters: [], ReadSchema: struct>
> == Results ==
> == Results ==
> !== Correct Answer - 10 == == Spark Answer - 10 ==
> !struct<> struct,p:int>
> ![Range 1 to 1,1] [WrappedArray(1),1]
> ![Range 1 to 10,1] [WrappedArray(1, 2),1]
> ![Range 1 to 2,1] [WrappedArray(1, 2, 3),1]
> ![Range 1 to 3,1] [WrappedArray(1, 2, 3, 4),1]
> ![Range 1 to 4,1] [WrappedArray(1, 2, 3, 4, 5),1]
> ![Range 1 to 5,1] [WrappedArray(1, 2, 3, 4, 5, 6),1]
> ![Range 1 to 6,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7),1]
> ![Range 1 to 7,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8),1]
> ![Range 1 to 8,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9),1]
> ![Range 1 to 9,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),1] 
> (QueryTest.scala:163){code}
> {code:java}
> - SPARK-2693 udaf aggregates test *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'GlobalLimit 1
> +- 'LocalLimit 1
> +- 'Project [unresolvedalias('percentile('key, 'array(1, 1)), None)]
> +- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> percentile(key, array(1, 1), 1): array
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, cast(array(1, 1) as array), 1, 
> 0, 0) AS percentile(key, array(1, 1), 1)#205101]
> +- SubqueryAlias `default`.`src`
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Optimized Logical Plan ==
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, [1.0,1.0], 1, 0, 0) AS percentile(key, 
> array(1, 1), 1)#205101]
> +- Project [key#205098]
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Physical Plan ==
> CollectLimit 1
> +- ObjectHashAggregate(keys=[], functions=[percentile(key#205098, [1.0,1.0], 
> 1, 0, 0)], output=[percentile(key, array(1, 1), 1)#205101])
> +- Exchange SinglePartition
> +- ObjectHashAggregate(keys=[], functions=[partial_percentile(key#205098, 
> [1.0,1.0], 1, 0, 0)], output=[buf#205104])
> +- Scan hive default.src [key#205098], HiveTableRelation `default`.`src`, 

[jira] [Commented] (SPARK-25251) Make spark-csv's `quote` and `escape` options conform to RFC 4180

2018-08-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594522#comment-16594522
 ] 

Hyukjin Kwon commented on SPARK-25251:
--

This is a duplicate of SPARK-22236

> Make spark-csv's `quote` and `escape` options conform to RFC 4180
> -
>
> Key: SPARK-25251
> URL: https://issues.apache.org/jira/browse/SPARK-25251
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0, 2.3.1, 2.4.0, 3.0.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> As described in [RFC-4180|https://tools.ietf.org/html/rfc4180], page 2 -
> {noformat}
>7. If double-quotes are used to enclose fields, then a double-quote 
> appearing inside a field must be escaped by preceding it with another double 
> quote
> {noformat}
> That's what Excel does, for example, by default.
> Although in Spark (as of Spark 2.1), escaping is done by default through 
> non-RFC way, using backslah (\). To fix this you have to explicitly tell 
> Spark to use doublequote to use for as an escape character:
> {code}
> .option('quote', '"') 
> .option('escape', '"')
> {code}
> This may explain that a comma character wasn't interpreted as it was inside a 
> quoted column.
> So this is request to make spark-csv reader RFC-4180 compatible in regards to 
> default option values for `quote` and `escape` (make both equal to " ).
> Since this is a backward-incompatible change, Spark 3.0 might be a good 
> release for this change.
> Some more background - https://stackoverflow.com/a/45138591/470583 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25251) Make spark-csv's `quote` and `escape` options conform to RFC 4180

2018-08-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25251.
--
Resolution: Duplicate

> Make spark-csv's `quote` and `escape` options conform to RFC 4180
> -
>
> Key: SPARK-25251
> URL: https://issues.apache.org/jira/browse/SPARK-25251
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0, 2.3.1, 2.4.0, 3.0.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> As described in [RFC-4180|https://tools.ietf.org/html/rfc4180], page 2 -
> {noformat}
>7. If double-quotes are used to enclose fields, then a double-quote 
> appearing inside a field must be escaped by preceding it with another double 
> quote
> {noformat}
> That's what Excel does, for example, by default.
> Although in Spark (as of Spark 2.1), escaping is done by default through 
> non-RFC way, using backslah (\). To fix this you have to explicitly tell 
> Spark to use doublequote to use for as an escape character:
> {code}
> .option('quote', '"') 
> .option('escape', '"')
> {code}
> This may explain that a comma character wasn't interpreted as it was inside a 
> quoted column.
> So this is request to make spark-csv reader RFC-4180 compatible in regards to 
> default option values for `quote` and `escape` (make both equal to " ).
> Since this is a backward-incompatible change, Spark 3.0 might be a good 
> release for this change.
> Some more background - https://stackoverflow.com/a/45138591/470583 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23255) Add user guide and examples for DataFrame image reading functions

2018-08-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594514#comment-16594514
 ] 

Hyukjin Kwon commented on SPARK-23255:
--

BTW, please be concise and precise on the documentation and contents since this 
is also one of the main feature in Spark 2.3.

> Add user guide and examples for DataFrame image reading functions
> -
>
> Key: SPARK-23255
> URL: https://issues.apache.org/jira/browse/SPARK-23255
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-21866 added built-in support for reading image data into a DataFrame. 
> This new functionality should be documented in the user guide, with example 
> usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23255) Add user guide and examples for DataFrame image reading functions

2018-08-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594513#comment-16594513
 ] 

Hyukjin Kwon commented on SPARK-23255:
--

Yea, I think so. For instance, https://github.com/apache/spark/pull/22121

> Add user guide and examples for DataFrame image reading functions
> -
>
> Key: SPARK-23255
> URL: https://issues.apache.org/jira/browse/SPARK-23255
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-21866 added built-in support for reading image data into a DataFrame. 
> This new functionality should be documented in the user guide, with example 
> usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21232) New built-in SQL function - Data_Type

2018-08-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21232.
--
Resolution: Won't Fix

Resolving this per the discussion in the JIRA

> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Priority: Minor
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23255) Add user guide and examples for DataFrame image reading functions

2018-08-27 Thread Divay Jindal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594488#comment-16594488
 ] 

Divay Jindal commented on SPARK-23255:
--

Hey, I have a very naive doubt, for this task do i need to add guide into docs/ 
directory or is it code level documentation and example?

> Add user guide and examples for DataFrame image reading functions
> -
>
> Key: SPARK-23255
> URL: https://issues.apache.org/jira/browse/SPARK-23255
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-21866 added built-in support for reading image data into a DataFrame. 
> This new functionality should be documented in the user guide, with example 
> usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24391) from_json should support arrays of primitives, and more generally all JSON

2018-08-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594479#comment-16594479
 ] 

Hyukjin Kwon commented on SPARK-24391:
--

For to_json, separate JIRA was filed in SPARK-25252

> from_json should support arrays of primitives, and more generally all JSON 
> ---
>
> Key: SPARK-24391
> URL: https://issues.apache.org/jira/browse/SPARK-24391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sam Kitajima-Kimbrel
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> https://issues.apache.org/jira/browse/SPARK-19849 and 
> https://issues.apache.org/jira/browse/SPARK-21513 brought support for more 
> column types to functions.to_json/from_json, but I also have cases where I'd 
> like to simply (de)serialize an array of primitives to/from JSON when 
> outputting to certain destinations, which does not work:
> {code:java}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq("[1, 2, 3]").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> val schema = new ArrayType(IntegerType, false)
> schema: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,false)
> scala> df.select(from_json($"a", schema))
> org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`a`)' 
> due to data type mismatch: Input schema array must be a struct or an 
> array of structs.;;
> 'Project [jsontostructs(ArrayType(IntegerType,false), a#3, 
> Some(America/Los_Angeles)) AS jsontostructs(a)#10]
> scala> val arrayDf = Seq(Array(1, 2, 3)).toDF("arr")
> arrayDf: org.apache.spark.sql.DataFrame = [arr: array]
> scala> arrayDf.select(to_json($"arr"))
> org.apache.spark.sql.AnalysisException: cannot resolve 'structstojson(`arr`)' 
> due to data type mismatch: Input type array must be a struct, array of 
> structs or a map or array of map.;;
> 'Project [structstojson(arr#19, Some(America/Los_Angeles)) AS 
> structstojson(arr)#26]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24391) from_json should support arrays of primitives, and more generally all JSON

2018-08-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24391:
-
Summary: from_json should support arrays of primitives, and more generally 
all JSON   (was: to_json/from_json should support arrays of primitives, and 
more generally all JSON )

> from_json should support arrays of primitives, and more generally all JSON 
> ---
>
> Key: SPARK-24391
> URL: https://issues.apache.org/jira/browse/SPARK-24391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sam Kitajima-Kimbrel
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> https://issues.apache.org/jira/browse/SPARK-19849 and 
> https://issues.apache.org/jira/browse/SPARK-21513 brought support for more 
> column types to functions.to_json/from_json, but I also have cases where I'd 
> like to simply (de)serialize an array of primitives to/from JSON when 
> outputting to certain destinations, which does not work:
> {code:java}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq("[1, 2, 3]").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> val schema = new ArrayType(IntegerType, false)
> schema: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,false)
> scala> df.select(from_json($"a", schema))
> org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`a`)' 
> due to data type mismatch: Input schema array must be a struct or an 
> array of structs.;;
> 'Project [jsontostructs(ArrayType(IntegerType,false), a#3, 
> Some(America/Los_Angeles)) AS jsontostructs(a)#10]
> scala> val arrayDf = Seq(Array(1, 2, 3)).toDF("arr")
> arrayDf: org.apache.spark.sql.DataFrame = [arr: array]
> scala> arrayDf.select(to_json($"arr"))
> org.apache.spark.sql.AnalysisException: cannot resolve 'structstojson(`arr`)' 
> due to data type mismatch: Input type array must be a struct, array of 
> structs or a map or array of map.;;
> 'Project [structstojson(arr#19, Some(America/Los_Angeles)) AS 
> structstojson(arr)#26]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25213.
-
   Resolution: Fixed
 Assignee: Li Jin
Fix Version/s: 2.4.0

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24721:
---

Assignee: Li Jin

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 

[jira] [Resolved] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24721.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22104
[https://github.com/apache/spark/pull/22104]

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)

[jira] [Created] (SPARK-25258) Upgrade kryo package to version 4.0.2+

2018-08-27 Thread liupengcheng (JIRA)
liupengcheng created SPARK-25258:


 Summary: Upgrade kryo package to version 4.0.2+
 Key: SPARK-25258
 URL: https://issues.apache.org/jira/browse/SPARK-25258
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 2.3.1, 2.1.0
Reporter: liupengcheng


Recently, we encountered a kryo performance issue in spark2.1.0, and the issue 
affect all kryo below 4.0.2, so it seems that all spark version might encounter 
this issue.

Issue description:

In shuffle write phase or some spilling operation, spark will use kryo 
serializer to serialize data if `spark.serializer` is set to `KryoSerializer`, 
however, when data contains some extremely large records, kryoSerializer's 
MapReferenceResolver would be expand, and it's `reset` method will take a long 
time to reset all items in writtenObjects table to null.

com.esotericsoftware.kryo.util.MapReferenceResolver
{code:java}
public void reset () {
 readObjects.clear();
 writtenObjects.clear();
}

public void clear () {
 K[] keyTable = this.keyTable;
 for (int i = capacity + stashSize; i-- > 0;)
  keyTable[i] = null;
 size = 0;
 stashSize = 0;
}
{code}
I checked the kryo project in github, and this issue seems fixed in 4.0.2+

[https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]

 

I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix this 
problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2018-08-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23207:

Fix Version/s: 2.1.4

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.1.4, 2.2.3, 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-08-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25164:

Fix Version/s: 2.3.2
   2.2.3

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25257) v2 MicroBatchReaders can't resume from checkpoints

2018-08-27 Thread Seth Fitzsimmons (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594298#comment-16594298
 ] 

Seth Fitzsimmons commented on SPARK-25257:
--

I traced this a bit further; {{OffsetSeq.toStreamProgress}} looks like it might 
be the right place to deserialize. {{deserialize.patch}} deserializes if the 
source is a {{MicroBatchReader}} (i.e. has a {{deserializeOffset}} method).

{{runBatch}} seems partially right (it's specific to {{MicroBatchReader}}s), 
but {{available}} (variable in scope) may be either a {{SerializedOffset}} or 
{{Offset}} depending whether a checkpoint is being resumed or if it's starting 
from a clean slate.

> v2 MicroBatchReaders can't resume from checkpoints
> --
>
> Key: SPARK-25257
> URL: https://issues.apache.org/jira/browse/SPARK-25257
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Seth Fitzsimmons
>Priority: Major
> Attachments: deserialize.patch
>
>
> When resuming from a checkpoint:
> {code:java}
> writeStream.option("checkpointLocation", 
> "/tmp/checkpoint").format("console").start
> {code}
> The stream reader fails with:
> {noformat}
> osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to 
> org.apache.spark.sql.sources.v2.reader.streaming.Offset
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at 
> org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
>

[jira] [Updated] (SPARK-25257) v2 MicroBatchReaders can't resume from checkpoints

2018-08-27 Thread Seth Fitzsimmons (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Fitzsimmons updated SPARK-25257:
-
Description: 
When resuming from a checkpoint:
{code:java}
writeStream.option("checkpointLocation", 
"/tmp/checkpoint").format("console").start
{code}
The stream reader fails with:
{noformat}
osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to 
org.apache.spark.sql.sources.v2.reader.streaming.Offset
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more
{noformat}
The root cause appears to be that the {{SerializedOffset}} (JSON, from disk) is 
never deserialized; I would expect to see something along the lines of 
{{reader.deserializeOffset(off.json)}} here (unless {{available}} is intended 
to be deserialized elsewhere):

[https://github.com/apache/spark/blob/4dc82259d81102e0cb48f4cb2e8075f80d899ac4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L405]

  was:
When resuming from a checkpoint:

{code:java}
writeStream.option("checkpointLocation", 
"/tmp/checkpoint").format("console").start
{code}

The stream reader fails with:


{noformat}
osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at 
o

[jira] [Updated] (SPARK-25257) v2 MicroBatchReaders can't resume from checkpoints

2018-08-27 Thread Seth Fitzsimmons (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Fitzsimmons updated SPARK-25257:
-
Summary: v2 MicroBatchReaders can't resume from checkpoints  (was: 
java.lang.ClassCastException: 
org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to 
org.apache.spark.sql.sources.v2.reader.streaming.Offset)

> v2 MicroBatchReaders can't resume from checkpoints
> --
>
> Key: SPARK-25257
> URL: https://issues.apache.org/jira/browse/SPARK-25257
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Seth Fitzsimmons
>Priority: Major
> Attachments: deserialize.patch
>
>
> When resuming from a checkpoint:
> {code:java}
> writeStream.option("checkpointLocation", 
> "/tmp/checkpoint").format("console").start
> {code}
> The stream reader fails with:
> {noformat}
> osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to 
> org.apache.spark.sql.sources.v2.reader.streaming.Offset
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at 
> org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
>   ... 1 more
> {noformat}
> The root cause appears to be that the `SerializedOffset` (JSON, from disk) is 
> never deserialized; I would expect to 

[jira] [Updated] (SPARK-25257) java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset

2018-08-27 Thread Seth Fitzsimmons (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Fitzsimmons updated SPARK-25257:
-
Attachment: deserialize.patch

> java.lang.ClassCastException: 
> org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to 
> org.apache.spark.sql.sources.v2.reader.streaming.Offset
> -
>
> Key: SPARK-25257
> URL: https://issues.apache.org/jira/browse/SPARK-25257
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Seth Fitzsimmons
>Priority: Major
> Attachments: deserialize.patch
>
>
> When resuming from a checkpoint:
> {code:java}
> writeStream.option("checkpointLocation", 
> "/tmp/checkpoint").format("console").start
> {code}
> The stream reader fails with:
> {noformat}
> osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to 
> org.apache.spark.sql.sources.v2.reader.streaming.Offset
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at 
> org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
>   ... 1 more
> {noformat}
> The root cause appears to be that the `SerializedOffset` (JSON, from disk) is 
> never deseria

[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti

2018-08-27 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594281#comment-16594281
 ] 

Thomas Graves commented on SPARK-25250:
---

We are hitting a race condition here between the taskSetManager and the 
DAGScheduler.  

There is code in the TaskSetManager that is supposed to mark all task attempts 
for a partition completed when any of them succeed, but in this case the second 
attempt has been finished. There is also code that only starts tasks in the 
second attempt that have not yet finished, but again there is a race here 
between when the taskSetManager sends the message that the task has ended and 
when it starts a new stage attempt.

 In the example given above stage 4.0 has fetch failed, it reran the map stage, 
task 9000 for partition 9000 in stage 4.0 finishes and sends a taskEnded 
messages to DAGSCheduler, before the DAGScheduler processed that task finished, 
it calculates the tasks needed for stage 4.1 which included the task for 
partition 9000, so it runs a task for partition 9000 but it always just fails 
with commitDenied and continues to rerun that task.

when task 9000 for stage 4.0 finished the taskSetManager calls into 
sched.markPartitionCompletedInAllTaskSets but since the 4.1 stage attempt 
hadn't been created yet it didn't mark the task 2000 for that partition as 
completed and since the DAGScheduler hadn't processed the taskEnd event for it, 
when it started stage 4.1 it started a task when it didn't need to.  We need to 
figure out how to handle the race.

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-08-27 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594271#comment-16594271
 ] 

Erik Erlandson commented on SPARK-21097:


I'm wondering if this is going to be subsumed by the Shuffle Service redesign 
proposal.

cc [~mcheah]

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25257) java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset

2018-08-27 Thread Seth Fitzsimmons (JIRA)
Seth Fitzsimmons created SPARK-25257:


 Summary: java.lang.ClassCastException: 
org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to 
org.apache.spark.sql.sources.v2.reader.streaming.Offset
 Key: SPARK-25257
 URL: https://issues.apache.org/jira/browse/SPARK-25257
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Seth Fitzsimmons


When resuming from a checkpoint:

{code:java}
writeStream.option("checkpointLocation", 
"/tmp/checkpoint").format("console").start
{code}

The stream reader fails with:


{noformat}
osmesa.common.streaming.AugmentedDiffMicroBatchReader@59e19287
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to 
org.apache.spark.sql.sources.v2.reader.streaming.Offset
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more
{noformat}

The root cause appears to be that the `SerializedOffset` (JSON, from disk) is 
never deserialized; I would expect to see something along the lines of 
`reader.deserializeOffset(off.json)` here (unless `available` is intended to be 
deserialized elsewhere):

https://github.com/apache/spark/blob/4dc82259d81102e0cb48f4cb2e8075f80d899ac4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L405



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Resolved] (SPARK-24090) Kubernetes Backend Hotlist for Spark 2.4

2018-08-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24090.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 4
[https://github.com/apache/spark/pull/4]

> Kubernetes Backend Hotlist for Spark 2.4
> 
>
> Key: SPARK-24090
> URL: https://issues.apache.org/jira/browse/SPARK-24090
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594213#comment-16594213
 ] 

Dongjoon Hyun commented on SPARK-25175:
---

Okay, thank you for the details, [~seancxmao]. 

BTW, for me, the higher purpose is JIRA is the reverse of this JIRA title. What 
you are aiming is to support `Case Sensitivity` in Spark, right?

In general, `spark.sql.caseSensitive=true` has been `false` by default since 
Spark 2.0.0. And, by default, Spark prevents you from generating such data. 
(not only ORC.)
{code}
scala> data.write.format("orc").mode("overwrite").save("/tmp/orc")
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
inserting into file:/tmp/orc: `c`;
{code}

Thus, your use case is specifically *generating* data with 
`spark.sql.caseSensitive=true` and reading data with 
`spark.sql.caseSensitive=false` (without problems).
Do you really have a *case-sensitive* data(schema)? Could you elaborate more 
why you hit this situation?

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the lower case field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. When there 
> are duplicate fields (e.g. c, C), we just can't read hive serde tables. If 
> there are no duplicate fields, hive serde tables always do case-insensitive 
> field resolution regardless of case sensitivity mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25256) Plan mismatch errors in Hive tests in 2.12

2018-08-27 Thread Sean Owen (JIRA)
Sean Owen created SPARK-25256:
-

 Summary: Plan mismatch errors in Hive tests in 2.12
 Key: SPARK-25256
 URL: https://issues.apache.org/jira/browse/SPARK-25256
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Sean Owen


In Hive tests, in the Scala 2.12 build, still seeing a few failures that seem 
to show mismatching schema inference. Not clear whether it's the same as 
SPARK-25044. Examples:
{code:java}
- SPARK-5775 read array from partitioned_parquet_with_key_and_complextypes *** 
FAILED ***
Results do not match for query:
Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
Timezone Env: 

== Parsed Logical Plan ==
'Project ['arrayField, 'p]
+- 'Filter ('p = 1)
+- 'UnresolvedRelation `partitioned_parquet_with_key_and_complextypes`

== Analyzed Logical Plan ==
arrayField: array, p: int
Project [arrayField#82569, p#82570]
+- Filter (p#82570 = 1)
+- SubqueryAlias `default`.`partitioned_parquet_with_key_and_complextypes`
+- 
Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
 parquet

== Optimized Logical Plan ==
Project [arrayField#82569, p#82570]
+- Filter (isnotnull(p#82570) && (p#82570 = 1))
+- 
Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
 parquet

== Physical Plan ==
*(1) Project [arrayField#82569, p#82570]
+- *(1) FileScan parquet 
default.partitioned_parquet_with_key_and_complextypes[arrayField#82569,p#82570] 
Batched: false, Format: Parquet, Location: 
PrunedInMemoryFileIndex[file:/home/srowen/spark-2.12/sql/hive/target/tmp/spark-d8d87d74-33e7-4f22...,
 PartitionCount: 1, PartitionFilters: [isnotnull(p#82570), (p#82570 = 1)], 
PushedFilters: [], ReadSchema: struct>
== Results ==

== Results ==
!== Correct Answer - 10 == == Spark Answer - 10 ==
!struct<> struct,p:int>
![Range 1 to 1,1] [WrappedArray(1),1]
![Range 1 to 10,1] [WrappedArray(1, 2),1]
![Range 1 to 2,1] [WrappedArray(1, 2, 3),1]
![Range 1 to 3,1] [WrappedArray(1, 2, 3, 4),1]
![Range 1 to 4,1] [WrappedArray(1, 2, 3, 4, 5),1]
![Range 1 to 5,1] [WrappedArray(1, 2, 3, 4, 5, 6),1]
![Range 1 to 6,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7),1]
![Range 1 to 7,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8),1]
![Range 1 to 8,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9),1]
![Range 1 to 9,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),1] 
(QueryTest.scala:163){code}
{code:java}
- SPARK-2693 udaf aggregates test *** FAILED ***
Results do not match for query:
Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
Timezone Env: 

== Parsed Logical Plan ==
'GlobalLimit 1
+- 'LocalLimit 1
+- 'Project [unresolvedalias('percentile('key, 'array(1, 1)), None)]
+- 'UnresolvedRelation `src`

== Analyzed Logical Plan ==
percentile(key, array(1, 1), 1): array
GlobalLimit 1
+- LocalLimit 1
+- Aggregate [percentile(key#205098, cast(array(1, 1) as array), 1, 0, 
0) AS percentile(key, array(1, 1), 1)#205101]
+- SubqueryAlias `default`.`src`
+- HiveTableRelation `default`.`src`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]

== Optimized Logical Plan ==
GlobalLimit 1
+- LocalLimit 1
+- Aggregate [percentile(key#205098, [1.0,1.0], 1, 0, 0) AS percentile(key, 
array(1, 1), 1)#205101]
+- Project [key#205098]
+- HiveTableRelation `default`.`src`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]

== Physical Plan ==
CollectLimit 1
+- ObjectHashAggregate(keys=[], functions=[percentile(key#205098, [1.0,1.0], 1, 
0, 0)], output=[percentile(key, array(1, 1), 1)#205101])
+- Exchange SinglePartition
+- ObjectHashAggregate(keys=[], functions=[partial_percentile(key#205098, 
[1.0,1.0], 1, 0, 0)], output=[buf#205104])
+- Scan hive default.src [key#205098], HiveTableRelation `default`.`src`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
== Results ==

== Results ==
!== Correct Answer - 1 == == Spark Answer - 1 ==
!struct> struct>
![WrappedArray(498, 498)] [WrappedArray(498.0, 498.0)] 
(QueryTest.scala:163){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To u

[jira] [Assigned] (SPARK-25235) Merge the REPL code in Scala 2.11 and 2.12 branches

2018-08-27 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-25235:
---

Assignee: DB Tsai

> Merge the REPL code in Scala 2.11 and 2.12 branches
> ---
>
> Key: SPARK-25235
> URL: https://issues.apache.org/jira/browse/SPARK-25235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.3.1
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Using some reflection tricks to merge Scala 2.11 and 2.12 codebase



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16281) Implement parse_url SQL function

2018-08-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594169#comment-16594169
 ] 

Apache Spark commented on SPARK-16281:
--

User 'TomaszGaweda' has created a pull request for this issue:
https://github.com/apache/spark/pull/22249

> Implement parse_url SQL function
> 
>
> Key: SPARK-16281
> URL: https://issues.apache.org/jira/browse/SPARK-16281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Jian Wu
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25255) Add getActiveSession to SparkSession in PySpark

2018-08-27 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-25255:

Labels: starter  (was: )

> Add getActiveSession to SparkSession in PySpark
> ---
>
> Key: SPARK-25255
> URL: https://issues.apache.org/jira/browse/SPARK-25255
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> Add getActiveSession to PySpark session API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25255) Add getActiveSession to SparkSession in PySpark

2018-08-27 Thread holdenk (JIRA)
holdenk created SPARK-25255:
---

 Summary: Add getActiveSession to SparkSession in PySpark
 Key: SPARK-25255
 URL: https://issues.apache.org/jira/browse/SPARK-25255
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: holdenk


Add getActiveSession to PySpark session API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25254) docker-image-tool should allow not building images for R and Python

2018-08-27 Thread Chaoran Yu (JIRA)
Chaoran Yu created SPARK-25254:
--

 Summary: docker-image-tool should allow not building images for R 
and Python
 Key: SPARK-25254
 URL: https://issues.apache.org/jira/browse/SPARK-25254
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Chaoran Yu


Suppose a user does not need R, so he builds Spark using make-distribution.sh 
without R to generate a runnable distribution. 

Then he runs docker-image-tool.sh to generate Docker images based off of the 
distribution just produced in the previous step. And then docker-image-tool.sh 
fails because the R image fails to build. This is counter-intuitive.

docker-image-tool.sh builds three images: regular, R and Python. 
make-distribution.sh allows users to specify whether to build Spark with R or 
Python, so should docker-image-tool.sh.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25253) Refactor pyspark connection & authentication

2018-08-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25253:


Assignee: (was: Apache Spark)

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Minor
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25253) Refactor pyspark connection & authentication

2018-08-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25253:


Assignee: Apache Spark

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Minor
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication

2018-08-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594055#comment-16594055
 ] 

Apache Spark commented on SPARK-25253:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/22247

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Minor
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple time

2018-08-27 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-25250:
--
Description: 
We recently had a scenario where a race condition occurred when a task from 
previous stage attempt just finished before new attempt for the same stage was 
created due to fetch failure, so the new task created in the second attempt on 
the same partition id was retrying multiple times due to TaskCommitDenied 
Exception without realizing that the task in earlier attempt was already 
successful.  

For example, consider a task with partition id 9000 and index 9000 running in 
stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
Just within this timespan, the above task completes successfully, thus, marking 
the partition id 9000 as complete for 4.0. However, as stage 4.1 has not yet 
been created, the taskset info for that stage is not available to the 
TaskScheduler so, naturally, the partition id 9000 has not been marked 
completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
partition id 9000. This task fails due to CommitDeniedException and since, it 
does not see the corresponding partition id as been marked successful, it keeps 
retrying multiple times until the job finally succeeds. It doesn't cause any 
job failures because the DAG scheduler is tracking the partitions separate from 
the task set managers.

 

Steps to Reproduce:
 # Run any large job involving shuffle operation.
 # When the ShuffleMap stage finishes and the ResultStage begins running, cause 
this stage to throw a fetch failure exception(Try deleting certain shuffle 
files on any host).
 # Observe the task attempt numbers for the next stage attempt. Please note 
that this issue is an intermittent one, so it might not happen all the time.

  was:
We recently had a scenario where a race condition occurred when a task from 
previous stage attempt just finished before new attempt for the same stage was 
created due to fetch failure, so the new task created in the second attempt on 
the same partition id was retrying multiple times due to TaskCommitDenied 
Exception without realizing that the task in earlier attempt was already 
successful.  

For example, consider a task with partition id 9000 and index 9000 running in 
stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
Just within this timespan, the above task completes successfully, thus, marking 
the partition id 9000 as complete for 4.0. However, as stage 4.1 has not yet 
been created, the taskset info for that stage is not available to the 
TaskScheduler so, naturally, the partition id 9000 has not been marked 
completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
partition id 9000. This task fails due to CommitDeniedException and since, it 
does not see the corresponding partition id as been marked successful, it keeps 
retrying multiple times. 

 

Steps to Reproduce:
 # Run any large job involving shuffle operation.
 # When the ShuffleMap stage finishes and the ResultStage begins running, cause 
this stage to throw a fetch failure exception(Try deleting certain shuffle 
files on any host).
 # Observe the task attempt numbers for the next stage attempt. Please note 
that this issue is an intermittent one, so it might not happen all the time.


> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. St

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-27 Thread Henry Robinson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594046#comment-16594046
 ] 

Henry Robinson commented on SPARK-24434:


Yeah, assignees are set after the PR is merged. I think the idea is to prevent 
someone from assigning an issue to themselves, effectively taking a lock, and 
never actually making progress. 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-27 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594045#comment-16594045
 ] 

Stavros Kontopoulos commented on SPARK-24434:
-

Maybe it is added when the issue is resolved or something.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple time

2018-08-27 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-25250:
--
Priority: Major  (was: Minor)

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times. 
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-27 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594038#comment-16594038
 ] 

Yinan Li commented on SPARK-24434:
--

It seemed I couldn't change the assignee.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25235) Merge the REPL code in Scala 2.11 and 2.12 branches

2018-08-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594036#comment-16594036
 ] 

Apache Spark commented on SPARK-25235:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22246

> Merge the REPL code in Scala 2.11 and 2.12 branches
> ---
>
> Key: SPARK-25235
> URL: https://issues.apache.org/jira/browse/SPARK-25235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.3.1
>Reporter: DB Tsai
>Priority: Major
>
> Using some reflection tricks to merge Scala 2.11 and 2.12 codebase



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25235) Merge the REPL code in Scala 2.11 and 2.12 branches

2018-08-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25235:


Assignee: (was: Apache Spark)

> Merge the REPL code in Scala 2.11 and 2.12 branches
> ---
>
> Key: SPARK-25235
> URL: https://issues.apache.org/jira/browse/SPARK-25235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.3.1
>Reporter: DB Tsai
>Priority: Major
>
> Using some reflection tricks to merge Scala 2.11 and 2.12 codebase



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25235) Merge the REPL code in Scala 2.11 and 2.12 branches

2018-08-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25235:


Assignee: Apache Spark

> Merge the REPL code in Scala 2.11 and 2.12 branches
> ---
>
> Key: SPARK-25235
> URL: https://issues.apache.org/jira/browse/SPARK-25235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.3.1
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>
> Using some reflection tricks to merge Scala 2.11 and 2.12 codebase



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-27 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594024#comment-16594024
 ] 

Stavros Kontopoulos commented on SPARK-24434:
-

Thanks [~liyinan926] I am reviewing the PR. Still this is not assigned though, 
pls assign to [~yifeih]. Anyway, this is not about who does what but about 
being fair. Never understood the intentions here.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25253) Refactor pyspark connection & authentication

2018-08-27 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-25253:


 Summary: Refactor pyspark connection & authentication
 Key: SPARK-25253
 URL: https://issues.apache.org/jira/browse/SPARK-25253
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Imran Rashid


We've got a few places in pyspark that connect to local sockets, with varying 
levels of ipv6 handling, graceful error handling, and lots of copy-and-paste.  
should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25252) Support arrays of any types in to_json

2018-08-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25252:


Assignee: Apache Spark

> Support arrays of any types in to_json
> --
>
> Key: SPARK-25252
> URL: https://issues.apache.org/jira/browse/SPARK-25252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Need to improve the to_json function and make it more consistent with 
> from_json by supporting arrays of any types (as root types). For now, it 
> supports only arrays of structs and arrays of maps.  After the changes the 
> following code should work:
> {code:scala}
> select to_json(array('1','2','3'))
> > ["1","2","3"]
> select to_json(array(array(1,2,3),array(4)))
> > [[1,2,3],[4]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25252) Support arrays of any types in to_json

2018-08-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25252:


Assignee: (was: Apache Spark)

> Support arrays of any types in to_json
> --
>
> Key: SPARK-25252
> URL: https://issues.apache.org/jira/browse/SPARK-25252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to improve the to_json function and make it more consistent with 
> from_json by supporting arrays of any types (as root types). For now, it 
> supports only arrays of structs and arrays of maps.  After the changes the 
> following code should work:
> {code:scala}
> select to_json(array('1','2','3'))
> > ["1","2","3"]
> select to_json(array(array(1,2,3),array(4)))
> > [[1,2,3],[4]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25252) Support arrays of any types in to_json

2018-08-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594005#comment-16594005
 ] 

Apache Spark commented on SPARK-25252:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/6

> Support arrays of any types in to_json
> --
>
> Key: SPARK-25252
> URL: https://issues.apache.org/jira/browse/SPARK-25252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to improve the to_json function and make it more consistent with 
> from_json by supporting arrays of any types (as root types). For now, it 
> supports only arrays of structs and arrays of maps.  After the changes the 
> following code should work:
> {code:scala}
> select to_json(array('1','2','3'))
> > ["1","2","3"]
> select to_json(array(array(1,2,3),array(4)))
> > [[1,2,3],[4]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25252) Support arrays of any types in to_json

2018-08-27 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25252:
--

 Summary: Support arrays of any types in to_json
 Key: SPARK-25252
 URL: https://issues.apache.org/jira/browse/SPARK-25252
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to improve the to_json function and make it more consistent with from_json 
by supporting arrays of any types (as root types). For now, it supports only 
arrays of structs and arrays of maps.  After the changes the following code 
should work:
{code:scala}
select to_json(array('1','2','3'))
> ["1","2","3"]
select to_json(array(array(1,2,3),array(4)))
> [[1,2,3],[4]]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24882) data source v2 API improvement

2018-08-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593951#comment-16593951
 ] 

Apache Spark commented on SPARK-24882:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/22245

> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull part of the API, think of better naming 
> of some interfaces. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25249) Add a unit test for OpenHashMap

2018-08-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25249:
-

Assignee: liuxian

> Add a unit test for OpenHashMap
> ---
>
> Key: SPARK-25249
> URL: https://issues.apache.org/jira/browse/SPARK-25249
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Trivial
> Fix For: 2.4.0
>
>
> Adding a unit test for OpenHashMap , this can help developers  to distinguish 
> between the 0/0.0/0L and non-exist value
> {color:#629755} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25249) Add a unit test for OpenHashMap

2018-08-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25249.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22241
[https://github.com/apache/spark/pull/22241]

> Add a unit test for OpenHashMap
> ---
>
> Key: SPARK-25249
> URL: https://issues.apache.org/jira/browse/SPARK-25249
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Trivial
> Fix For: 2.4.0
>
>
> Adding a unit test for OpenHashMap , this can help developers  to distinguish 
> between the 0/0.0/0L and non-exist value
> {color:#629755} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25249) Add a unit test for OpenHashMap

2018-08-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25249:
--
Priority: Trivial  (was: Minor)

I don't think this kind of thing needs a JIRA

> Add a unit test for OpenHashMap
> ---
>
> Key: SPARK-25249
> URL: https://issues.apache.org/jira/browse/SPARK-25249
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Trivial
>
> Adding a unit test for OpenHashMap , this can help developers  to distinguish 
> between the 0/0.0/0L and non-exist value
> {color:#629755} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25240) A deadlock in ALTER TABLE RECOVER PARTITIONS

2018-08-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25240:

Target Version/s: 2.4.0

> A deadlock in ALTER TABLE RECOVER PARTITIONS
> 
>
> Key: SPARK-25240
> URL: https://issues.apache.org/jira/browse/SPARK-25240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Blocker
>
> Recover Partitions in ALTER TABLE is performed in recursive way by calling 
> the scanPartitions() method. scanPartitions() lists files sequentially or in 
> parallel if the 
> [condition|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L685]
>  is true:
> {code:scala}
> partitionNames.length > 1 && statuses.length > threshold || 
> partitionNames.length > 2
> {code}
> Parallel listening is executed on [the fixed thread 
> pool|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L622]
>  which can have 8 threads in total. Dead lock occurs when all 8 cores have 
> been already occupied and recursive call of scanPartitions() submits new 
> parallel file listening.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25240) A deadlock in ALTER TABLE RECOVER PARTITIONS

2018-08-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25240:

Priority: Blocker  (was: Major)

> A deadlock in ALTER TABLE RECOVER PARTITIONS
> 
>
> Key: SPARK-25240
> URL: https://issues.apache.org/jira/browse/SPARK-25240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Blocker
>
> Recover Partitions in ALTER TABLE is performed in recursive way by calling 
> the scanPartitions() method. scanPartitions() lists files sequentially or in 
> parallel if the 
> [condition|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L685]
>  is true:
> {code:scala}
> partitionNames.length > 1 && statuses.length > threshold || 
> partitionNames.length > 2
> {code}
> Parallel listening is executed on [the fixed thread 
> pool|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L622]
>  which can have 8 threads in total. Dead lock occurs when all 8 cores have 
> been already occupied and recursive call of scanPartitions() submits new 
> parallel file listening.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25251) Make spark-csv's `quote` and `escape` options conform to RFC 4180

2018-08-27 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-25251:
-

 Summary: Make spark-csv's `quote` and `escape` options conform to 
RFC 4180
 Key: SPARK-25251
 URL: https://issues.apache.org/jira/browse/SPARK-25251
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.1, 2.3.0, 2.4.0, 3.0.0
Reporter: Ruslan Dautkhanov


As described in [RFC-4180|https://tools.ietf.org/html/rfc4180], page 2 -

{noformat}
   7. If double-quotes are used to enclose fields, then a double-quote 
appearing inside a field must be escaped by preceding it with another double 
quote
{noformat}

That's what Excel does, for example, by default.

Although in Spark (as of Spark 2.1), escaping is done by default through 
non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark 
to use doublequote to use for as an escape character:

{code}
.option('quote', '"') 
.option('escape', '"')
{code}

This may explain that a comma character wasn't interpreted as it was inside a 
quoted column.

So this is request to make spark-csv reader RFC-4180 compatible in regards to 
default option values for `quote` and `escape` (make both equal to " ).

Since this is a backward-incompatible change, Spark 3.0 might be a good release 
for this change.

Some more background - https://stackoverflow.com/a/45138591/470583 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory

2018-08-27 Thread Yunling Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593919#comment-16593919
 ] 

Yunling Cai commented on SPARK-25091:
-

Thanks [~Chao Fang] for working on this! I have changed the ticket title. 

Quick question: does this mean this is just a UI issue where executor 
information was shown incorrectly? As we saw the cache tables start falling 
onto disk even though we have uncache the previous copy of it. We also started 
seeing duplicate entries on the storage tab for same table and this is why we 
think the memory clean up may have actual problems.

Steps to reproduce:

CACHE TABLE A

UNCACHE TABLE A

CACHE TABLE A

REFRESH TABLE has a similar behavior. 

 

Thanks!

> UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
> -
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-27 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593913#comment-16593913
 ] 

Xiao Li commented on SPARK-23874:
-

ping [~bryanc] again. 

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory

2018-08-27 Thread Yunling Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunling Cai updated SPARK-25091:

Summary: UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up 
executor memory  (was: Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does 
not clean up executor memory)

> UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
> -
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25091) Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor memory

2018-08-27 Thread Chao Fang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593908#comment-16593908
 ] 

Chao Fang commented on SPARK-25091:
---

hi [~dongjoon],

Class AppStatusListener use LiveRDD and LiveExecutor to keep tracking of RDD 
and Executor info respectively.  Though LiveRDD tracks RDD info correctly no 
matter we cache or unpersist the RDD, LiveExecutor fails to track the Executor 
info when we unpersist RDD, that is why Executor Tab in SparkUI does not change 
when we unpersist rdd or UNCACHE Table. I would soon create pr for this issue.

But I don't know how to change the title.

> Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor 
> memory
> 
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593200#comment-16593200
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 3:53 PM:
---

Also here is similar investigation I did for parquet tables. Just for your 
information: [https://github.com/apache/spark/pull/22184/files#r212405373]


was (Author: seancxmao):
Also here is the similar investigation I did for parquet tables. Just for your 
information: https://github.com/apache/spark/pull/22184/files#r212405373

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the lower case field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. When there 
> are duplicate fields (e.g. c, C), we just can't read hive serde tables. If 
> there are no duplicate fields, hive serde tables always do case-insensitive 
> field resolution regardless of case sensitivity mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593858#comment-16593858
 ] 

Apache Spark commented on SPARK-25213:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22244

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593857#comment-16593857
 ] 

Apache Spark commented on SPARK-24721:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22244

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sq

[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25206:

Target Version/s: 2.3.2, 2.4.0  (was: 2.3.2)

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-27 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593840#comment-16593840
 ] 

Xiao Li commented on SPARK-25206:
-

Silently ignoring it is bad. We should issue an exception like what we did in 
the previous version? What do you think? 

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25227) Extend functionality of to_json to support arrays of differently-typed elements

2018-08-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25227.
--
Resolution: Duplicate

> Extend functionality of to_json to support arrays of differently-typed 
> elements
> ---
>
> Key: SPARK-25227
> URL: https://issues.apache.org/jira/browse/SPARK-25227
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'to_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'to_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements

2018-08-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593819#comment-16593819
 ] 

Hyukjin Kwon commented on SPARK-25226:
--

This is fixed in the current master:

{code}
>>> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
... ArrayType(StringType( # Does not work, because not a struct of 
array of structs
>>> df.show()
++---++
|data| id| parsed_data|
++---++
|["string1", true,...|  1|[string1, true,]|
|["string2", false...|  2|   [string2, false,]|
|["string3", true,...|  3|[string3, true, a...|
++---++
{code}

Catalog string is preferred over JSON string support, which should rather be 
deprecated.

> Extend functionality of from_json to support arrays of differently-typed 
> elements
> -
>
> Key: SPARK-25226
> URL: https://issues.apache.org/jira/browse/SPARK-25226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'from_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements

2018-08-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25226.
--
Resolution: Won't Fix

> Extend functionality of from_json to support arrays of differently-typed 
> elements
> -
>
> Key: SPARK-25226
> URL: https://issues.apache.org/jira/browse/SPARK-25226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'from_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements

2018-08-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593816#comment-16593816
 ] 

Hyukjin Kwon commented on SPARK-25226:
--

can you use:

{code}
>>> df = df.withColumn("parsed_data", F.from_json(F.col('data'), 
>>> "array"))
>>> df.show()
++---++
|data| id| parsed_data|
++---++
|["string1", true,...|  1|[string1, true,]|
|["string2", false...|  2|   [string2, false,]|
|["string3", true,...|  3|[string3, true, a...|
++---++
{code}

instead?

> Extend functionality of from_json to support arrays of differently-typed 
> elements
> -
>
> Key: SPARK-25226
> URL: https://issues.apache.org/jira/browse/SPARK-25226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'from_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25225) Add support for "List"-Type columns

2018-08-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593810#comment-16593810
 ] 

Hyukjin Kwon commented on SPARK-25225:
--

{quote}
 I coul cast it to string, but then later, when I convert it to JSON it will 
receive quotation marks around it, and I don't want that, I want whatever 
client software will read this JSON, to read an integer and not a string.
{quote}

Would you mind if I ask to clarify this?

> Add support for "List"-Type columns
> ---
>
> Key: SPARK-25225
> URL: https://issues.apache.org/jira/browse/SPARK-25225
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, Spark Dataframe ArrayType-columns only support all elements of 
> the array being of same data type.
> At our company, we are currently rewriting old MapReduce code with Spark. One 
> of the frequent use-cases is aggregating data into timeseries:
> Example input:
> {noformat}
> IDdatedata
> 1 2017-01-01  data_1_1
> 1 2018-02-02  data_1_2
> 2 2017-03-03  data_2_1
> 3 2018-04-04  data 2_2
> ...
> {noformat}
> Expected outpus:
> {noformat}
> IDtimeseries
> 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
> 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
> ...
> {noformat}
> Here, the values in the data column of the input are, in most cases, not 
> primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
> does not support creating an array column of a string column and a non-string 
> column.
> We would like to kindly ask you to implement one of the following:
> 1. Extend ArrayType to support elements of different data type
> 2. Introduce a new container type (ListType?) which would support elements of 
> different type
> UPDATE: The background here is, that I want to be able to parse JSON-arrays 
> of differently-typed elements into SPARK Dataframe columns, as well as create 
> JSON arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple time

2018-08-27 Thread Parth Gandhi (JIRA)
Parth Gandhi created SPARK-25250:


 Summary: Race condition with tasks running when new attempt for 
same stage is created leads to other task in the next attempt running on the 
same partition id retry multiple times
 Key: SPARK-25250
 URL: https://issues.apache.org/jira/browse/SPARK-25250
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 2.3.1
Reporter: Parth Gandhi


We recently had a scenario where a race condition occurred when a task from 
previous stage attempt just finished before new attempt for the same stage was 
created due to fetch failure, so the new task created in the second attempt on 
the same partition id was retrying multiple times due to TaskCommitDenied 
Exception without realizing that the task in earlier attempt was already 
successful.  

For example, consider a task with partition id 9000 and index 9000 running in 
stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
Just within this timespan, the above task completes successfully, thus, marking 
the partition id 9000 as complete for 4.0. However, as stage 4.1 has not yet 
been created, the taskset info for that stage is not available to the 
TaskScheduler so, naturally, the partition id 9000 has not been marked 
completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
partition id 9000. This task fails due to CommitDeniedException and since, it 
does not see the corresponding partition id as been marked successful, it keeps 
retrying multiple times. 

 

Steps to Reproduce:
 # Run any large job involving shuffle operation.
 # When the ShuffleMap stage finishes and the ResultStage begins running, cause 
this stage to throw a fetch failure exception(Try deleting certain shuffle 
files on any host).
 # Observe the task attempt numbers for the next stage attempt. Please note 
that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2018-08-27 Thread Bang Xiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593445#comment-16593445
 ] 

Bang Xiao commented on SPARK-24009:
---

any progress here?  i met the same error 

sql: INSERT OVERWRITE LOCAL DIRECTORY '/search/odin/test' row format delimited 
FIELDS TERMINATED BY '\t' select vrid, query, url, loc_city from 
custom.common_wap_vr where logdate >= '2018073000' and logdate <= '2018073023' 
and vrid = '11000801' group by vrid,query, loc_city,url; 

spark command is : spark-sql --master yarn --deploy-mode client -f test.sql

*hive.exec.scratchdir* : /user/hive/datadir-tmp

*hive.exec.stagingdir* : /user/hive/datadir-tmp

18/08/27 17:16:21 ERROR util.Utils: Aborting task 
org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs 
failed to create 
file:/user/hive/datadir-tmp_hive_2018-08-27_17-14-45_908_2829491226961893146-1/-ext-1/_temporary/0/_temporary/attempt_20180827171619_0002_m_00_0
 (exists=false, 
cwd=file:/search/hadoop09/yarn_local/usercache/ultraman/appcache/application_1535079600137_133521/container_e09_1535079600137_133521_01_51)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
 at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
 at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Mkdirs 
failed to create 
file:/user/hive/datadir-tmp_hive_2018-08-27_17-14-45_908_2829491226961893146-1/-ext-1/_temporary/0/_temporary/attempt_20180827171619_0002_m_00_0
 (exists=false, 
cwd=file:/search/hadoop09/yarn_local/usercache/ultraman/appcache/application_1535079600137_133521/container_e09_1535079600137_133521_01_51)
 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447) 
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433) 
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:925) at 
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:818) at 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)

> spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' 
> -
>
> Key: SPARK-24009
> URL: https://issues.apache.org/jira/browse/SPARK-24009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: chris_j
>Priority: Major
>
> local mode  spark execute "INSERT OVERWRITE LOCAL DIRECTORY " successfully.
> on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not 
> permission problem also 
>  
> 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row 
> format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write local directory successful
> 2.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
> delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_da

[jira] [Commented] (SPARK-25227) Extend functionality of to_json to support arrays of differently-typed elements

2018-08-27 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593368#comment-16593368
 ] 

Maxim Gekk commented on SPARK-25227:


> I don't know about to_json. Maybe Maxim Gekk can comment more on that.
Here is the PR for that: https://github.com/apache/spark/pull/6 . Please, 
review it.

> Extend functionality of to_json to support arrays of differently-typed 
> elements
> ---
>
> Key: SPARK-25227
> URL: https://issues.apache.org/jira/browse/SPARK-25227
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'to_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'to_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25227) Extend functionality of to_json to support arrays of differently-typed elements

2018-08-27 Thread Yuriy Davygora (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593286#comment-16593286
 ] 

Yuriy Davygora commented on SPARK-25227:


[~hyukjin.kwon] I only know, that in the upcoming release, the from_json 
function will support arrays of primitive types. I don't know about to_json. 
Maybe [~maxgekk] can comment more on that.

As for the code, I cannot provide you with a snippet, because I cannot even 
reach the stage where I would use the to_json function. Let's say I have a 
dataframe with two columns: "string" which is of type string and "int" which is 
of integer type. I cannot even do:

{noformat}
df = df.withColumn("new_column", F.array("string", "int"))
{noformat}

because Spark ArrayType does not support elements of different type. If this 
were otherwise, next step for me would have been smth. like

{noformat}
df.select(F.to_json("new_column"))
{noformat}

See also [[SPARK-25226]]

> Extend functionality of to_json to support arrays of differently-typed 
> elements
> ---
>
> Key: SPARK-25227
> URL: https://issues.apache.org/jira/browse/SPARK-25227
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'to_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'to_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements

2018-08-27 Thread Yuriy Davygora (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593283#comment-16593283
 ] 

Yuriy Davygora commented on SPARK-25226:


[~hyukjin.kwon]

Yes, sure. Here is some simple Python code to reproduce the problems (using 
pyspark 2.3.1 and pandas 0.23.4):

{noformat}
import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, ArrayType

spark = SparkSession.builder.appName('test').getOrCreate()

data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
false, null]', '["string3", true, "another_string3"]']}
pdf = pd.DataFrame.from_dict(data)
df = spark.createDataFrame(pdf)
df.show()

df = df.withColumn("parsed_data", F.from_json(F.col('data'),
ArrayType(StringType( # Does not work, because not a struct of array of 
structs

df = df.withColumn("parsed_data", F.from_json(F.col('data'),

'{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
 # Does not work at all
{noformat}

> Extend functionality of from_json to support arrays of differently-typed 
> elements
> -
>
> Key: SPARK-25226
> URL: https://issues.apache.org/jira/browse/SPARK-25226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'from_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25225) Add support for "List"-Type columns

2018-08-27 Thread Yuriy Davygora (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593282#comment-16593282
 ] 

Yuriy Davygora commented on SPARK-25225:


[~maropu] Sorry, I was not quite clear on the ultimate purpose, which is to 
convert such columns to JSON arrays and also convert JSON arrays to Spark 
Dataframe columns.

[~hyukjin.kwon] Let's say, that in the above example, data is, say, just an 
integer. I coul cast it to string, but then later, when I convert it to JSON it 
will receive quotation marks around it, and I don't want that, I want whatever 
client software will read this JSON, to read an integer and not a string.

Besides, in most cases, 'data' is not a primitive typed column. For example, in 
the code that I am working on right now, it is an array of arrays of integers, 
which cannot be cast to a string that easily.

> Add support for "List"-Type columns
> ---
>
> Key: SPARK-25225
> URL: https://issues.apache.org/jira/browse/SPARK-25225
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, Spark Dataframe ArrayType-columns only support all elements of 
> the array being of same data type.
> At our company, we are currently rewriting old MapReduce code with Spark. One 
> of the frequent use-cases is aggregating data into timeseries:
> Example input:
> {noformat}
> IDdatedata
> 1 2017-01-01  data_1_1
> 1 2018-02-02  data_1_2
> 2 2017-03-03  data_2_1
> 3 2018-04-04  data 2_2
> ...
> {noformat}
> Expected outpus:
> {noformat}
> IDtimeseries
> 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
> 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
> ...
> {noformat}
> Here, the values in the data column of the input are, in most cases, not 
> primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
> does not support creating an array column of a string column and a non-string 
> column.
> We would like to kindly ask you to implement one of the following:
> 1. Extend ArrayType to support elements of different data type
> 2. Introduce a new container type (ListType?) which would support elements of 
> different type
> UPDATE: The background here is, that I want to be able to parse JSON-arrays 
> of differently-typed elements into SPARK Dataframe columns, as well as create 
> JSON arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Description: 
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there are 
duplicate fields (e.g. c, C), we just can't read hive serde tables. If there 
are no duplicate fields, hive serde tables always do case-insensitive field 
resolution regardless of case sensitivity mode.

  was:
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there is 
ambiguity, we just can't read hive serde tables.


> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the lower case field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. When there 
> are duplicate fields (e.g. c, C), we just can't read hive serde tables. If 
> there are no duplicate fields, hive serde tables always do case-insensitive 
> field resolution regardless of case sensitivity mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25225) Add support for "List"-Type columns

2018-08-27 Thread Yuriy Davygora (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy Davygora updated SPARK-25225:
---
Description: 
At the moment, Spark Dataframe ArrayType-columns only support all elements of 
the array being of same data type.

At our company, we are currently rewriting old MapReduce code with Spark. One 
of the frequent use-cases is aggregating data into timeseries:

Example input:
{noformat}
ID  datedata
1   2017-01-01  data_1_1
1   2018-02-02  data_1_2
2   2017-03-03  data_2_1
3   2018-04-04  data 2_2
...
{noformat}

Expected outpus:
{noformat}
ID  timeseries
1   [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
2   [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
...
{noformat}

Here, the values in the data column of the input are, in most cases, not 
primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
does not support creating an array column of a string column and a non-string 
column.

We would like to kindly ask you to implement one of the following:

1. Extend ArrayType to support elements of different data type

2. Introduce a new container type (ListType?) which would support elements of 
different type

UPDATE: The background here is, that I want to be able to parse JSON-arrays of 
differently-typed elements into SPARK Dataframe columns, as well as create JSON 
arrays from such columns.

  was:
At the moment, Spark Dataframe ArrayType-columns only support all elements of 
the array being of same data type.

At our company, we are currently rewriting old MapReduce code with Spark. One 
of the frequent use-cases is aggregating data into timeseries:

Example input:
{noformat}
ID  datedata
1   2017-01-01  data_1_1
1   2018-02-02  data_1_2
2   2017-03-03  data_2_1
3   2018-04-04  data 2_2
...
{noformat}

Expected outpus:
{noformat}
ID  timeseries
1   [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
2   [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
...
{noformat}

Here, the values in the data column of the input are, in most cases, not 
primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
does not support creating an array column of a string column and a non-string 
column.

We would like to kindly ask you to implement one of the following:

1. Extend ArrayType to support elements of different data type

2. Introduce a new container type (ListType?) which would support elements of 
different type




> Add support for "List"-Type columns
> ---
>
> Key: SPARK-25225
> URL: https://issues.apache.org/jira/browse/SPARK-25225
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, Spark Dataframe ArrayType-columns only support all elements of 
> the array being of same data type.
> At our company, we are currently rewriting old MapReduce code with Spark. One 
> of the frequent use-cases is aggregating data into timeseries:
> Example input:
> {noformat}
> IDdatedata
> 1 2017-01-01  data_1_1
> 1 2018-02-02  data_1_2
> 2 2017-03-03  data_2_1
> 3 2018-04-04  data 2_2
> ...
> {noformat}
> Expected outpus:
> {noformat}
> IDtimeseries
> 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
> 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
> ...
> {noformat}
> Here, the values in the data column of the input are, in most cases, not 
> primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
> does not support creating an array column of a string column and a non-string 
> column.
> We would like to kindly ask you to implement one of the following:
> 1. Extend ArrayType to support elements of different data type
> 2. Introduce a new container type (ListType?) which would support elements of 
> different type
> UPDATE: The background here is, that I want to be able to parse JSON-arrays 
> of differently-typed elements into SPARK Dataframe columns, as well as create 
> JSON arrays from such columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25225) Add support for "List"-Type columns

2018-08-27 Thread Yuriy Davygora (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy Davygora updated SPARK-25225:
---
Description: 
At the moment, Spark Dataframe ArrayType-columns only support all elements of 
the array being of same data type.

At our company, we are currently rewriting old MapReduce code with Spark. One 
of the frequent use-cases is aggregating data into timeseries:

Example input:
{noformat}
ID  datedata
1   2017-01-01  data_1_1
1   2018-02-02  data_1_2
2   2017-03-03  data_2_1
3   2018-04-04  data 2_2
...
{noformat}

Expected outpus:
{noformat}
ID  timeseries
1   [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
2   [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
...
{noformat}

Here, the values in the data column of the input are, in most cases, not 
primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
does not support creating an array column of a string column and a non-string 
column.

We would like to kindly ask you to implement one of the following:

1. Extend ArrayType to support elements of different data type

2. Introduce a new container type (ListType?) which would support elements of 
different type

UPDATE: The background here is, that I want to be able to parse JSON-arrays of 
differently-typed elements into SPARK Dataframe columns, as well as create JSON 
arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]]

  was:
At the moment, Spark Dataframe ArrayType-columns only support all elements of 
the array being of same data type.

At our company, we are currently rewriting old MapReduce code with Spark. One 
of the frequent use-cases is aggregating data into timeseries:

Example input:
{noformat}
ID  datedata
1   2017-01-01  data_1_1
1   2018-02-02  data_1_2
2   2017-03-03  data_2_1
3   2018-04-04  data 2_2
...
{noformat}

Expected outpus:
{noformat}
ID  timeseries
1   [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
2   [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
...
{noformat}

Here, the values in the data column of the input are, in most cases, not 
primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
does not support creating an array column of a string column and a non-string 
column.

We would like to kindly ask you to implement one of the following:

1. Extend ArrayType to support elements of different data type

2. Introduce a new container type (ListType?) which would support elements of 
different type

UPDATE: The background here is, that I want to be able to parse JSON-arrays of 
differently-typed elements into SPARK Dataframe columns, as well as create JSON 
arrays from such columns.


> Add support for "List"-Type columns
> ---
>
> Key: SPARK-25225
> URL: https://issues.apache.org/jira/browse/SPARK-25225
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, Spark Dataframe ArrayType-columns only support all elements of 
> the array being of same data type.
> At our company, we are currently rewriting old MapReduce code with Spark. One 
> of the frequent use-cases is aggregating data into timeseries:
> Example input:
> {noformat}
> IDdatedata
> 1 2017-01-01  data_1_1
> 1 2018-02-02  data_1_2
> 2 2017-03-03  data_2_1
> 3 2018-04-04  data 2_2
> ...
> {noformat}
> Expected outpus:
> {noformat}
> IDtimeseries
> 1 [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
> 2 [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
> ...
> {noformat}
> Here, the values in the data column of the input are, in most cases, not 
> primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
> does not support creating an array column of a string column and a non-string 
> column.
> We would like to kindly ask you to implement one of the following:
> 1. Extend ArrayType to support elements of different data type
> 2. Introduce a new container type (ListType?) which would support elements of 
> different type
> UPDATE: The background here is, that I want to be able to parse JSON-arrays 
> of differently-typed elements into SPARK Dataframe columns, as well as create 
> JSON arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25227) Extend functionality of to_json to support arrays of differently-typed elements

2018-08-27 Thread Yuriy Davygora (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy Davygora updated SPARK-25227:
---
Summary: Extend functionality of to_json to support arrays of 
differently-typed elements  (was: Extend functionality of to_json)

> Extend functionality of to_json to support arrays of differently-typed 
> elements
> ---
>
> Key: SPARK-25227
> URL: https://issues.apache.org/jira/browse/SPARK-25227
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'to_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'to_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements

2018-08-27 Thread Yuriy Davygora (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy Davygora updated SPARK-25226:
---
Summary: Extend functionality of from_json to support arrays of 
differently-typed elements  (was: Extend functionality of from_json)

> Extend functionality of from_json to support arrays of differently-typed 
> elements
> -
>
> Key: SPARK-25226
> URL: https://issues.apache.org/jira/browse/SPARK-25226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'from_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593194#comment-16593194
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 7:50 AM:
---

[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables with hive impl always return a,B,c, no matter whether 
spark.sql.caseSensitive is set to true or false and no matter metastore table 
schema is in lower case or upper case. It seems they always do case-insensitive 
field resolution, and if there is ambiguity they return the lower case ones. 
Given ORC file schema is (a,B,c,C)
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return C in scenario 12?
 ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather 
than always return lower case one?

 * The data source tables with native impl, compared to hive impl, handle 
scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in 
the same way as hive impl.
 * The hive serde tables always throw IndexOutOfBoundsException at runtime when 
there are duplicate fields (e.g. c, C). If there is no duplicate fields, it 
seems hive serde tables always do case-insensitive field resolution, just like 
hive implementation of OrcFileFormat.
 * Since in case-sensitive mode analysis should fail if a column name in query 
and metastore schema are in different cases, all AnalysisException(s) meet our 
expectation.

Stacktrace of IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: toIndex = 4
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}
 


was (Author: seancxmao):
[~dongjoon] [~yucai] Here is a brief summary

[jira] [Resolved] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.

2018-08-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24978.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21931
[https://github.com/apache/spark/pull/21931]

> Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity 
> of fast aggregation.
> -
>
> Key: SPARK-24978
> URL: https://issues.apache.org/jira/browse/SPARK-24978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Major
> Fix For: 2.4.0
>
>
> this pr add a configuration parameter to configure the capacity of fast 
> aggregation. 
> Performance comparison:
>     /*
>     Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
>     Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
>     Aggregate w multiple keys:   Best/Avg Time(ms)    Rate(M/s)   
> Per Row(ns)   Relative
>     
> 
>     fasthash = default    5612 / 5882  3.7
>  267.6   1.0X
>     fasthash = config 3586 / 3595  5.8
>  171.0   1.6X
>     */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.

2018-08-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24978:
---

Assignee: caoxuewen

> Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity 
> of fast aggregation.
> -
>
> Key: SPARK-24978
> URL: https://issues.apache.org/jira/browse/SPARK-24978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Major
>
> this pr add a configuration parameter to configure the capacity of fast 
> aggregation. 
> Performance comparison:
>     /*
>     Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
>     Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
>     Aggregate w multiple keys:   Best/Avg Time(ms)    Rate(M/s)   
> Per Row(ns)   Relative
>     
> 
>     fasthash = default    5612 / 5882  3.7
>  267.6   1.0X
>     fasthash = config 3586 / 3595  5.8
>  171.0   1.6X
>     */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25249) Add a unit test for OpenHashMap

2018-08-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593268#comment-16593268
 ] 

Apache Spark commented on SPARK-25249:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/22241

> Add a unit test for OpenHashMap
> ---
>
> Key: SPARK-25249
> URL: https://issues.apache.org/jira/browse/SPARK-25249
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> Adding a unit test for OpenHashMap , this can help developers  to distinguish 
> between the 0/0.0/0L and non-exist value
> {color:#629755} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25249) Add a unit test for OpenHashMap

2018-08-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25249:


Assignee: (was: Apache Spark)

> Add a unit test for OpenHashMap
> ---
>
> Key: SPARK-25249
> URL: https://issues.apache.org/jira/browse/SPARK-25249
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> Adding a unit test for OpenHashMap , this can help developers  to distinguish 
> between the 0/0.0/0L and non-exist value
> {color:#629755} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25249) Add a unit test for OpenHashMap

2018-08-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25249:


Assignee: Apache Spark

> Add a unit test for OpenHashMap
> ---
>
> Key: SPARK-25249
> URL: https://issues.apache.org/jira/browse/SPARK-25249
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Minor
>
> Adding a unit test for OpenHashMap , this can help developers  to distinguish 
> between the 0/0.0/0L and non-exist value
> {color:#629755} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 7:33 AM:
---

Thorough investigation about ORC tables with duplicate fields (c and C).
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table, hive impl)||orc column
 (select via data source table, native impl)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |a|IndexOutOfBoundsException |
|2| | |b|B |null|IndexOutOfBoundsException |
|3| | |c|c |c|IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException|
|8| | |b|AnalysisException |AnalysisException|AnalysisException |
|9| | |c|AnalysisException |AnalysisException|AnalysisException |
|10| | |A|a |null|IndexOutOfBoundsException |
|11| | |B|B |B|IndexOutOfBoundsException |
|12| | |C|c |C|IndexOutOfBoundsException |
|13|false|a, b, c|a|a |a|IndexOutOfBoundsException |
|14| | |b|B |B|IndexOutOfBoundsException |
|15| | |c|c |c|IndexOutOfBoundsException |
|16| | |A|a |a|IndexOutOfBoundsException |
|17| | |B|B |B|IndexOutOfBoundsException |
|18| | |C|c |c|IndexOutOfBoundsException |
|19| |A, B, C|a|a |a|IndexOutOfBoundsException |
|20| | |b|B |B|IndexOutOfBoundsException |
|21| | |c|c |c|IndexOutOfBoundsException |
|22| | |A|a |a|IndexOutOfBoundsException |
|23| | |B|B |B|IndexOutOfBoundsException |
|24| | |C|c |c|IndexOutOfBoundsException |

Followup tests that use ORC files with no duplicate fields (only a,B).
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data_nodup")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data_nodup//user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc
Type: struct

CREATE TABLE orc_nodup_hive_serde_lower (a LONG, b LONG) STORED AS orc LOCATION 
'/user/hive/warehouse/orc_data_nodup'
CREATE TABLE orc_nodup_hive_serde_upper (A LONG, B LONG) STORED AS orc LOCATION 
'/user/hive/warehouse/orc_data_nodup'

DESC EXTENDED orc_nodup_hive_serde_lower;
DESC EXTENDED orc_nodup_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
||no.||caseSensitive||table columns||select column||orc column
 (select via hive serde table)||
|1|true|a, b|a|a|
|2| | |b|B|
|4| | |A|AnalysisException|
|5| | |B|AnalysisException|
|7| |A, B|a|AnalysisException|
|8| | |b|AnalysisException |
|10| | |A|a|
|11| | |B|B|
|13|false|a, b|a|a|
|14| | |b|B|
|16| | |A|a|
|17| | |B|B|
|19| |A, B|a|a|
|20| | |b|B|
|22| | |A|a|
|23| | |B|B|


was (Author: seancxmao):
Thorough investigation about ORC tables with duplicate fields (c and C).
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A

[jira] [Created] (SPARK-25249) Add a unit test for OpenHashMap

2018-08-27 Thread liuxian (JIRA)
liuxian created SPARK-25249:
---

 Summary: Add a unit test for OpenHashMap
 Key: SPARK-25249
 URL: https://issues.apache.org/jira/browse/SPARK-25249
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.4.0
Reporter: liuxian


Adding a unit test for OpenHashMap , this can help developers  to distinguish 
between the 0/0.0/0L and non-exist value

{color:#629755} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 7:24 AM:
---

Thorough investigation about ORC tables with duplicate fields (c and C).
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table, hive impl)||orc column
 (select via data source table, native impl)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |a|IndexOutOfBoundsException |
|2| | |b|B |null|IndexOutOfBoundsException |
|3| | |c|c |c|IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException|
|8| | |b|AnalysisException |AnalysisException|AnalysisException |
|9| | |c|AnalysisException |AnalysisException|AnalysisException |
|10| | |A|a |null|IndexOutOfBoundsException |
|11| | |B|B |B|IndexOutOfBoundsException |
|12| | |C|c |C|IndexOutOfBoundsException |
|13|false|a, b, c|a|a |a|IndexOutOfBoundsException |
|14| | |b|B |B|IndexOutOfBoundsException |
|15| | |c|c |c|IndexOutOfBoundsException |
|16| | |A|a |a|IndexOutOfBoundsException |
|17| | |B|B |B|IndexOutOfBoundsException |
|18| | |C|c |c|IndexOutOfBoundsException |
|19| |A, B, C|a|a |a|IndexOutOfBoundsException |
|20| | |b|B |B|IndexOutOfBoundsException |
|21| | |c|c |c|IndexOutOfBoundsException |
|22| | |A|a |a|IndexOutOfBoundsException |
|23| | |B|B |B|IndexOutOfBoundsException |
|24| | |C|c |c|IndexOutOfBoundsException |


was (Author: seancxmao):
Thorough investigation about ORC tables
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table, hive impl)||orc column
(select via data source table, native impl)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |a|IndexOutOfBoundsException |
|2| | |b|B |null|IndexOutOfBoundsException |
|3| | |c|c |c|IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException|
|8| | |b|AnalysisException |AnalysisException|AnalysisException |
|9| | |c|AnalysisException |AnalysisException|AnalysisException |
|10| | |A|a |null|IndexOutOfBoundsException |
|11| | |B|B |B|IndexOutOfBoundsException |
|12| | |C|c |C|IndexOutOfB

[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Description: 
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there is 
ambiguity, we just can't read hive serde tables.

  was:
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there i


> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the lower case field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. When there is 
> ambiguity, we just can't read hive serde tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Description: 
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there i

  was:
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
* Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat doesn't support case-insensitive field 
resolution at all.
* SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.


> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the lower case field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. When there i



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org