[jira] [Commented] (SPARK-21055) Support grouping__id
[ https://issues.apache.org/jira/browse/SPARK-21055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213758#comment-16213758 ] Apache Spark commented on SPARK-21055: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19546 > Support grouping__id > > > Key: SPARK-21055 > URL: https://issues.apache.org/jira/browse/SPARK-21055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark2.1.1 >Reporter: cen yuhai >Assignee: cen yuhai > Fix For: 2.3.0 > > > Now, spark doesn't support grouping__id, spark provide another function > grouping_id() to workaround. > If use grouping_id(), many scripts need to change and supporting > grouping__id is very easy, why not? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22326) Remove unnecessary hashCode and equals methods
[ https://issues.apache.org/jira/browse/SPARK-22326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22326. - Resolution: Fixed Assignee: Zhenhua Wang Fix Version/s: 2.3.0 > Remove unnecessary hashCode and equals methods > -- > > Key: SPARK-22326 > URL: https://issues.apache.org/jira/browse/SPARK-22326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.3.0 > > > Plan equality should be computed by canonicalized, so we can remove > unnecessary hashCode and equals methods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21750) Use arrow 0.6.0
[ https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213702#comment-16213702 ] Kazuaki Ishizaki commented on SPARK-21750: -- Thank you. Good to hear that. > Use arrow 0.6.0 > --- > > Key: SPARK-21750 > URL: https://issues.apache.org/jira/browse/SPARK-21750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been > released, use the latest one -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22326) Remove unnecessary hashCode and equals methods
[ https://issues.apache.org/jira/browse/SPARK-22326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22326: Assignee: Apache Spark > Remove unnecessary hashCode and equals methods > -- > > Key: SPARK-22326 > URL: https://issues.apache.org/jira/browse/SPARK-22326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > > Plan equality should be computed by canonicalized, so we can remove > unnecessary hashCode and equals methods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22326) Remove unnecessary hashCode and equals methods
[ https://issues.apache.org/jira/browse/SPARK-22326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22326: Assignee: (was: Apache Spark) > Remove unnecessary hashCode and equals methods > -- > > Key: SPARK-22326 > URL: https://issues.apache.org/jira/browse/SPARK-22326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > Plan equality should be computed by canonicalized, so we can remove > unnecessary hashCode and equals methods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22326) Remove unnecessary hashCode and equals methods
Zhenhua Wang created SPARK-22326: Summary: Remove unnecessary hashCode and equals methods Key: SPARK-22326 URL: https://issues.apache.org/jira/browse/SPARK-22326 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Zhenhua Wang Plan equality should be computed by canonicalized, so we can remove unnecessary hashCode and equals methods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22326) Remove unnecessary hashCode and equals methods
[ https://issues.apache.org/jira/browse/SPARK-22326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213666#comment-16213666 ] Apache Spark commented on SPARK-22326: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/19539 > Remove unnecessary hashCode and equals methods > -- > > Key: SPARK-22326 > URL: https://issues.apache.org/jira/browse/SPARK-22326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > Plan equality should be computed by canonicalized, so we can remove > unnecessary hashCode and equals methods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22307) NOT condition working incorrectly
[ https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213432#comment-16213432 ] kevin yu commented on SPARK-22307: -- It is correct behavior based on SQL standards, as Marco said. Your query has 623 records: 617 records are null, 2 records are 'true', and 4 records are 'false'. So the not(col1) return 4. > NOT condition working incorrectly > - > > Key: SPARK-22307 > URL: https://issues.apache.org/jira/browse/SPARK-22307 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: Andrey Yakovenko > Attachments: Catalog.json.gz > > > Suggest test case: table with x record filtered by expression expr returns y > records (< x), not(expr) does not returns x-y records. Work around: > when(expr, false).otherwise(true) is working fine. > {code} > val ctg = spark.sqlContext.read.json("/user/Catalog.json") > scala> ctg.printSchema > root > |-- Id: string (nullable = true) > |-- Name: string (nullable = true) > |-- Parent: struct (nullable = true) > ||-- Id: string (nullable = true) > ||-- Name: string (nullable = true) > ||-- Parent: struct (nullable = true) > |||-- Id: string (nullable = true) > |||-- Name: string (nullable = true) > |||-- Parent: struct (nullable = true) > ||||-- Id: string (nullable = true) > ||||-- Name: string (nullable = true) > ||||-- Parent: string (nullable = true) > ||||-- SKU: string (nullable = true) > |||-- SKU: string (nullable = true) > ||-- SKU: string (nullable = true) > |-- SKU: string (nullable = true) > val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN > ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', > '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))") > col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR > (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, > 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4))) > scala> ctg.count > res5: Long = 623 > scala> ctg.filter(col1).count > res2: Long = 2 > scala> ctg.filter(not(col1)).count > res3: Long = 4 > scala> ctg.filter(when(col1, false).otherwise(true)).count > res4: Long = 621 > {code} > Table is hierarchy like structure and has a records with different number of > levels filled up. I have a suspicion that due to partly filled hierarchy > condition return null/undefined/failed/nan some times (neither true or false). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21929) Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source
[ https://issues.apache.org/jira/browse/SPARK-21929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21929: Assignee: (was: Apache Spark) > Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source > > > Key: SPARK-21929 > URL: https://issues.apache.org/jira/browse/SPARK-21929 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > > SPARK-19261 implemented `ADD COLUMNS` at Spark 2.2, but ORC data source is > not supported due to its limit. > {code} > scala> sql("CREATE TABLE tab (c1 int, c2 int, c3 int) USING ORC PARTITIONED > BY (c3)") > scala> sql("ALTER TABLE tab ADD COLUMNS (c4 int)") > org.apache.spark.sql.AnalysisException: > ALTER ADD COLUMNS does not support datasource table with type ORC. > You must drop and re-create the table for adding the new columns. Tables: > `tab`; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21929) Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source
[ https://issues.apache.org/jira/browse/SPARK-21929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213272#comment-16213272 ] Apache Spark commented on SPARK-21929: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19545 > Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source > > > Key: SPARK-21929 > URL: https://issues.apache.org/jira/browse/SPARK-21929 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > > SPARK-19261 implemented `ADD COLUMNS` at Spark 2.2, but ORC data source is > not supported due to its limit. > {code} > scala> sql("CREATE TABLE tab (c1 int, c2 int, c3 int) USING ORC PARTITIONED > BY (c3)") > scala> sql("ALTER TABLE tab ADD COLUMNS (c4 int)") > org.apache.spark.sql.AnalysisException: > ALTER ADD COLUMNS does not support datasource table with type ORC. > You must drop and re-create the table for adding the new columns. Tables: > `tab`; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21929) Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source
[ https://issues.apache.org/jira/browse/SPARK-21929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21929: Assignee: Apache Spark > Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source > > > Key: SPARK-21929 > URL: https://issues.apache.org/jira/browse/SPARK-21929 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > SPARK-19261 implemented `ADD COLUMNS` at Spark 2.2, but ORC data source is > not supported due to its limit. > {code} > scala> sql("CREATE TABLE tab (c1 int, c2 int, c3 int) USING ORC PARTITIONED > BY (c3)") > scala> sql("ALTER TABLE tab ADD COLUMNS (c4 int)") > org.apache.spark.sql.AnalysisException: > ALTER ADD COLUMNS does not support datasource table with type ORC. > You must drop and re-create the table for adding the new columns. Tables: > `tab`; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13
[ https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213216#comment-16213216 ] Cosmin Lehene edited comment on SPARK-11844 at 10/20/17 8:59 PM: - I'm seeing these with spark-2.2.1-SNAPSHOT {noformat} java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 14 at org.apache.parquet.format.Util.read(Util.java:216) at org.apache.parquet.format.Util.readPageHeader(Util.java:65) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:835) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:849) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:700) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: don't know what type: 14 at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806) at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500) at org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158) at org.apache.parquet.format.PageHeader.read(PageHeader.java:828) at org.apache.parquet.format.Util.read(Util.java:213) ... 24 more {noformat} Also they seem accompanied by {noformat} java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0) at org.apache.parquet.format.Util.read(Util.java:216) at org.apache.parquet.format.Util.readPageHeader(Util.java:65) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:835) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:849) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:700) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.execution.datasources.FileSc
[jira] [Commented] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13
[ https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213216#comment-16213216 ] Cosmin Lehene commented on SPARK-11844: --- I'm seeing these with spark-2.2.1-SNAPSHOT {noformat} java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 14 at org.apache.parquet.format.Util.read(Util.java:216) at org.apache.parquet.format.Util.readPageHeader(Util.java:65) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:835) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:849) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:700) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: don't know what type: 14 at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806) at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500) at org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158) at org.apache.parquet.format.PageHeader.read(PageHeader.java:828) at org.apache.parquet.format.Util.read(Util.java:213) ... 24 more {noformat} > can not read class org.apache.parquet.format.PageHeader: don't know what > type: 13 > - > > Key: SPARK-11844 > URL: https://issues.apache.org/jira/browse/SPARK-11844 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Minor > > I got the following error once when I was running a query > {code} > java.io.IOException: can not read class org.apache.parquet.format.PageHeader: > don't know what type: 13 > at org.apache.parquet.format.Util.read(Util.java:216) > at org.apache.parquet.format.Util.readPageHeader(Util.java:65) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) > a
[jira] [Comment Edited] (SPARK-22323) Design doc for different types of pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213182#comment-16213182 ] Bryan Cutler edited comment on SPARK-22323 at 10/20/17 8:30 PM: Is this meant to be a user doc? was (Author: bryanc): I this meant to be a user doc? > Design doc for different types of pandas_udf > > > Key: SPARK-22323 > URL: https://issues.apache.org/jira/browse/SPARK-22323 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22323) Design doc for different types of pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213182#comment-16213182 ] Bryan Cutler commented on SPARK-22323: -- I this meant to be a user doc? > Design doc for different types of pandas_udf > > > Key: SPARK-22323 > URL: https://issues.apache.org/jira/browse/SPARK-22323 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22323) Design doc for different types of pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213167#comment-16213167 ] Li Jin commented on SPARK-22323: Bryan I think Spark-1 is more about types between spark/arrow, like time stamp caveats. This is only meant to cover different types of pandas udf. > Design doc for different types of pandas_udf > > > Key: SPARK-22323 > URL: https://issues.apache.org/jira/browse/SPARK-22323 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22323) Design doc for different types of pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213151#comment-16213151 ] Bryan Cutler commented on SPARK-22323: -- Should I close SPARK-1 since it looks like the docs will be covered here? > Design doc for different types of pandas_udf > > > Key: SPARK-22323 > URL: https://issues.apache.org/jira/browse/SPARK-22323 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-22325) SPARK_TESTING env variable breaking non-spark builds on amplab jenkins
[ https://issues.apache.org/jira/browse/SPARK-22325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin deleted SPARK-22325: > SPARK_TESTING env variable breaking non-spark builds on amplab jenkins > -- > > Key: SPARK-22325 > URL: https://issues.apache.org/jira/browse/SPARK-22325 > Project: Spark > Issue Type: Bug > Environment: riselab jenkins, all workers (ubuntu & centos) >Reporter: shane knapp >Priority: Critical > > in the riselab jenkins master config, the SPARK_TESTING environment variable > is set to 1 and applied to all workers. > see: > https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-foo/9/console > (the 'echo 1' is actually 'echo $SPARK_TESTING') > and: > https://amplab.cs.berkeley.edu/jenkins/job/testing-foo/10/injectedEnvVars/ > this is problematic, as some of our lab builds are attempting to run pyspark > as part of the build process, and the hard-coded checks for SPARK_TESTING in > the setup scripts are causing hard failures. > see: > https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/2440/HADOOP_VERSION=2.6.2,SCALAVER=2.11,SPARK_VERSION=2.2.0,label=centos/consoleFull > i would strongly suggest that we do the following: > * remove the SPARK_TESTING environment variable declaration in the jenkins > config > * add the environment variable to each spark build config in github: > https://github.com/databricks/spark-jenkins-configurations/ > * add the environment variable to SparkPullRequstBuilder and > NewSparkPullRequestBuilder -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22325) SPARK_TESTING env variable breaking non-spark builds on amplab jenkins
[ https://issues.apache.org/jira/browse/SPARK-22325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp updated SPARK-22325: Description: in the riselab jenkins master config, the SPARK_TESTING environment variable is set to 1 and applied to all workers. see: https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-foo/9/console (the 'echo 1' is actually 'echo $SPARK_TESTING') and: https://amplab.cs.berkeley.edu/jenkins/job/testing-foo/10/injectedEnvVars/ this is problematic, as some of our lab builds are attempting to run pyspark as part of the build process, and the hard-coded checks for SPARK_TESTING in the setup scripts are causing hard failures. see: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/2440/HADOOP_VERSION=2.6.2,SCALAVER=2.11,SPARK_VERSION=2.2.0,label=centos/consoleFull i would strongly suggest that we do the following: * remove the SPARK_TESTING environment variable declaration in the jenkins config * add the environment variable to each spark build config in github: https://github.com/databricks/spark-jenkins-configurations/ * add the environment variable to SparkPullRequstBuilder and NewSparkPullRequestBuilder was: in the riselab jenkins master config, the SPARK_TESTING environment variable is set to 1 and applied to all workers. see: https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-foo/9/console (the 'echo 1' is actually 'echo $SPARK_TESTING') this is problematic, as some of our lab builds are attempting to run pyspark as part of the build process, and the hard-coded checks for SPARK_TESTING in the setup scripts are causing hard failures. see: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/2440/HADOOP_VERSION=2.6.2,SCALAVER=2.11,SPARK_VERSION=2.2.0,label=centos/consoleFull i would strongly suggest that we do the following: * remove the SPARK_TESTING environment variable declaration in the jenkins config * add the environment variable to each spark build config in github: https://github.com/databricks/spark-jenkins-configurations/ * add the environment variable to SparkPullRequstBuilder and NewSparkPullRequestBuilder > SPARK_TESTING env variable breaking non-spark builds on amplab jenkins > -- > > Key: SPARK-22325 > URL: https://issues.apache.org/jira/browse/SPARK-22325 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Affects Versions: 2.2.0 > Environment: riselab jenkins, all workers (ubuntu & centos) >Reporter: shane knapp >Priority: Critical > > in the riselab jenkins master config, the SPARK_TESTING environment variable > is set to 1 and applied to all workers. > see: > https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-foo/9/console > (the 'echo 1' is actually 'echo $SPARK_TESTING') > and: > https://amplab.cs.berkeley.edu/jenkins/job/testing-foo/10/injectedEnvVars/ > this is problematic, as some of our lab builds are attempting to run pyspark > as part of the build process, and the hard-coded checks for SPARK_TESTING in > the setup scripts are causing hard failures. > see: > https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/2440/HADOOP_VERSION=2.6.2,SCALAVER=2.11,SPARK_VERSION=2.2.0,label=centos/consoleFull > i would strongly suggest that we do the following: > * remove the SPARK_TESTING environment variable declaration in the jenkins > config > * add the environment variable to each spark build config in github: > https://github.com/databricks/spark-jenkins-configurations/ > * add the environment variable to SparkPullRequstBuilder and > NewSparkPullRequestBuilder -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22325) SPARK_TESTING env variable breaking non-spark builds on amplab jenkins
shane knapp created SPARK-22325: --- Summary: SPARK_TESTING env variable breaking non-spark builds on amplab jenkins Key: SPARK-22325 URL: https://issues.apache.org/jira/browse/SPARK-22325 Project: Spark Issue Type: Bug Components: Build, Project Infra Affects Versions: 2.2.0 Environment: riselab jenkins, all workers (ubuntu & centos) Reporter: shane knapp Priority: Critical in the riselab jenkins master config, the SPARK_TESTING environment variable is set to 1 and applied to all workers. see: https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-foo/9/console (the 'echo 1' is actually 'echo $SPARK_TESTING') this is problematic, as some of our lab builds are attempting to run pyspark as part of the build process, and the hard-coded checks for SPARK_TESTING in the setup scripts are causing hard failures. see: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/2440/HADOOP_VERSION=2.6.2,SCALAVER=2.11,SPARK_VERSION=2.2.0,label=centos/consoleFull i would strongly suggest that we do the following: * remove the SPARK_TESTING environment variable declaration in the jenkins config * add the environment variable to each spark build config in github: https://github.com/databricks/spark-jenkins-configurations/ * add the environment variable to SparkPullRequstBuilder and NewSparkPullRequestBuilder -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21750) Use arrow 0.6.0
[ https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-21750. - SPARK-22324 supercedes this. > Use arrow 0.6.0 > --- > > Key: SPARK-21750 > URL: https://issues.apache.org/jira/browse/SPARK-21750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been > released, use the latest one -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21750) Use arrow 0.6.0
[ https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213016#comment-16213016 ] Dongjoon Hyun commented on SPARK-21750: --- Great! Thanks, [~bryanc]. I'll monitor SPARK-22324. > Use arrow 0.6.0 > --- > > Key: SPARK-21750 > URL: https://issues.apache.org/jira/browse/SPARK-21750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been > released, use the latest one -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq
[ https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213015#comment-16213015 ] Randy Tidd edited comment on SPARK-22296 at 10/20/17 6:34 PM: -- Thank you I was just installing 2.2.0 and related components to try this but you beat me to it. Glad to hear it's fixed. Note that the example provided above by Liang-Chi Hsieh does not exhibit the problem in 2.1.0 or 2.1.1. was (Author: tiddman): Thank you I was just installing 2.2.0 and related components to try this but you beat me to it. Glad to hear it's fixed. > CodeGenerator - failed to compile when constructor has > scala.collection.mutable.Seq vs. scala.collection.Seq > > > Key: SPARK-22296 > URL: https://issues.apache.org/jira/browse/SPARK-22296 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Randy Tidd > > This is with Scala 2.11. > We have a case class that has a constructor with 85 args, the last two of > which are: > var chargesInst : > scala.collection.mutable.Seq[ChargeInstitutional] = > scala.collection.mutable.Seq.empty[ChargeInstitutional], > var chargesProf : > scala.collection.mutable.Seq[ChargeProfessional] = > scala.collection.mutable.Seq.empty[ChargeProfessional] > A mutable Seq in a the constructor of a case class is probably poor form but > Scala allows it. When we run this job we get this error: > build 17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch > worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 8217, Column 70: No applicable constructor/method found for actual parameters > "java.lang.String, java.lang.String, long, java.lang.String, long, long, > long, java.lang.String, long, long, double, scala.Option, scala.Option, > java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, long, long, long, long, > scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, java.lang.String, int, double, > double, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, > com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, > java.lang.String, long, java.lang.String, int, int, boolean, boolean, > scala.collection.Seq, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, scala.collection.Seq"; candidates are: > "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, > java.lang.String, long, long, long, java.lang.String, long, long, double, > scala.Option, scala.Option, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, long, scala.Option, scala.Option, scala.Option, > scala.Option, scala.Option, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, int, double, double, java.lang.String, java.lang.String, > java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, long, long, long, long, java.lang.String, > com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, > scala.collection.Seq, scala.collection.Seq, java.lang.String, long, > java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, boolean, scala.collection.mutable.Seq, > scala.collection.mutable.Seq)" > The relevant lines are: > build 17-Oct-2017 05:30:50/* 093 */ private scala.collection.Seq > argValue84; > build 17-Oct-2017 05:30:50/* 094 */ private scala.collection.Seq > argValue85; > and > build 17-Oct-2017 05:30:54/* 8217 */ final > com.xyz.xyz.xyz.domain.Account value1 = false ? null : new > com.xyz.xyz.xyz.domain.Account(argValue2, argValue3
[jira] [Commented] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq
[ https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213015#comment-16213015 ] Randy Tidd commented on SPARK-22296: Thank you I was just installing 2.2.0 and related components to try this but you beat me to it. Glad to hear it's fixed. > CodeGenerator - failed to compile when constructor has > scala.collection.mutable.Seq vs. scala.collection.Seq > > > Key: SPARK-22296 > URL: https://issues.apache.org/jira/browse/SPARK-22296 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Randy Tidd > > This is with Scala 2.11. > We have a case class that has a constructor with 85 args, the last two of > which are: > var chargesInst : > scala.collection.mutable.Seq[ChargeInstitutional] = > scala.collection.mutable.Seq.empty[ChargeInstitutional], > var chargesProf : > scala.collection.mutable.Seq[ChargeProfessional] = > scala.collection.mutable.Seq.empty[ChargeProfessional] > A mutable Seq in a the constructor of a case class is probably poor form but > Scala allows it. When we run this job we get this error: > build 17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch > worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 8217, Column 70: No applicable constructor/method found for actual parameters > "java.lang.String, java.lang.String, long, java.lang.String, long, long, > long, java.lang.String, long, long, double, scala.Option, scala.Option, > java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, long, long, long, long, > scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, java.lang.String, int, double, > double, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, > com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, > java.lang.String, long, java.lang.String, int, int, boolean, boolean, > scala.collection.Seq, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, scala.collection.Seq"; candidates are: > "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, > java.lang.String, long, long, long, java.lang.String, long, long, double, > scala.Option, scala.Option, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, long, scala.Option, scala.Option, scala.Option, > scala.Option, scala.Option, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, int, double, double, java.lang.String, java.lang.String, > java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, long, long, long, long, java.lang.String, > com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, > scala.collection.Seq, scala.collection.Seq, java.lang.String, long, > java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, boolean, scala.collection.mutable.Seq, > scala.collection.mutable.Seq)" > The relevant lines are: > build 17-Oct-2017 05:30:50/* 093 */ private scala.collection.Seq > argValue84; > build 17-Oct-2017 05:30:50/* 094 */ private scala.collection.Seq > argValue85; > and > build 17-Oct-2017 05:30:54/* 8217 */ final > com.xyz.xyz.xyz.domain.Account value1 = false ? null : new > com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, > argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, > argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, > argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, > argValue24, argValue25, argValue26, argValue27, argValue
[jira] [Commented] (SPARK-21750) Use arrow 0.6.0
[ https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213012#comment-16213012 ] Bryan Cutler commented on SPARK-21750: -- Thanks [~dongjoon], I opened SPARK-22324 under the Arrow epic so we can start discussing an upgrade to 0.8.0 which is slated for early Nov. > Use arrow 0.6.0 > --- > > Key: SPARK-21750 > URL: https://issues.apache.org/jira/browse/SPARK-21750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been > released, use the latest one -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22324) Upgrade Arrow to version 0.8.0
Bryan Cutler created SPARK-22324: Summary: Upgrade Arrow to version 0.8.0 Key: SPARK-22324 URL: https://issues.apache.org/jira/browse/SPARK-22324 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Bryan Cutler Arrow version 0.8.0 is slated for release in early November, but I'd like to start discussing to help get all the work that's being done synced up. Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test envs will need to be upgraded as well that will take a fair amount of work and planning. One topic I'd like to discuss is if pyarrow should be an installation requirement for pyspark, i.e. when a user pip installs pyspark, it will also install pyarrow. If not, then is there a minimum version that needs to be supported? We currently have 0.4.1 installed on Jenkins. There are a number of improvements and cleanups in the current code that can happen depending on what we decide (I'll link them all here later, but off the top of my head): * Decimal bug fix and improved support * Improved internal casting between pyarrow and pandas (can clean up some workarounds) * Better type checking when converting Spark types to Arrow * Timestamp conversion to microseconds (for Spark internal format) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq
[ https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212999#comment-16212999 ] Dongjoon Hyun commented on SPARK-22296: --- [~tiddman]. Please refer the following in 2.2. This is resolved as mentioned in the above comments. {code} scala> case class Foo1(x: Int, s: String, seq: scala.collection.mutable.Seq[Int] = scala.collection.mutable.Seq.empty[Int]) defined class Foo1 scala> val ds1 = Seq(Foo1(1, "a"), Foo1(2, "b")).toDS ds1: org.apache.spark.sql.Dataset[Foo1] = [x: int, s: string ... 1 more field] scala> case class Foo2(x: Int, s: String) defined class Foo2 scala> val ds2 = Seq(Foo2(1, "aa"), Foo2(3, "cc")).toDS ds2: org.apache.spark.sql.Dataset[Foo2] = [x: int, s: string] scala> ds1.joinWith(ds2, ds1.col("x") === ds2.col("x")).collect() res0: Array[(Foo1, Foo2)] = Array((Foo1(1,a,ArrayBuffer()),Foo2(1,aa))) scala> spark.version res1: String = 2.2.0 {code} > CodeGenerator - failed to compile when constructor has > scala.collection.mutable.Seq vs. scala.collection.Seq > > > Key: SPARK-22296 > URL: https://issues.apache.org/jira/browse/SPARK-22296 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Randy Tidd > > This is with Scala 2.11. > We have a case class that has a constructor with 85 args, the last two of > which are: > var chargesInst : > scala.collection.mutable.Seq[ChargeInstitutional] = > scala.collection.mutable.Seq.empty[ChargeInstitutional], > var chargesProf : > scala.collection.mutable.Seq[ChargeProfessional] = > scala.collection.mutable.Seq.empty[ChargeProfessional] > A mutable Seq in a the constructor of a case class is probably poor form but > Scala allows it. When we run this job we get this error: > build 17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch > worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 8217, Column 70: No applicable constructor/method found for actual parameters > "java.lang.String, java.lang.String, long, java.lang.String, long, long, > long, java.lang.String, long, long, double, scala.Option, scala.Option, > java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, long, long, long, long, > scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, java.lang.String, int, double, > double, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, > com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, > java.lang.String, long, java.lang.String, int, int, boolean, boolean, > scala.collection.Seq, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, scala.collection.Seq"; candidates are: > "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, > java.lang.String, long, long, long, java.lang.String, long, long, double, > scala.Option, scala.Option, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, long, scala.Option, scala.Option, scala.Option, > scala.Option, scala.Option, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, int, double, double, java.lang.String, java.lang.String, > java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, long, long, long, long, java.lang.String, > com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, > scala.collection.Seq, scala.collection.Seq, java.lang.String, long, > java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, boolean, scala.collection.mutable.Seq, > scala.collection.mutable.Seq)" > The relevant lines are: > build 17-Oct-2017 05:30:50
[jira] [Comment Edited] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq
[ https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212983#comment-16212983 ] Randy Tidd edited comment on SPARK-22296 at 10/20/17 6:17 PM: -- The example above does not exhibit the problem. Here is a concise example that does. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/10/20 14:09:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/10/20 14:09:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://10.10.43.134:4040 Spark context available as 'sc' (master = local[*], app id = local-1508522945163). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> case class Foo1(x: Int, s: String, seq: scala.collection.mutable.Seq[Int] = scala.collection.mutable.Seq.empty[Int]) defined class Foo1 scala> val ds1 = Seq(Foo1(1, "a"), Foo1(2, "b")).toDS ds1: org.apache.spark.sql.Dataset[Foo1] = [x: int, s: string ... 1 more field] scala> scala> case class Foo2(x: Int, s: String) defined class Foo2 scala> val ds2 = Seq(Foo2(1, "aa"), Foo2(3, "cc")).toDS ds2: org.apache.spark.sql.Dataset[Foo2] = [x: int, s: string] scala> scala> ds1.joinWith(ds2, ds1.col("x") === ds2.col("x")).collect() 17/10/20 14:09:14 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 263, Column 69: No applicable constructor/method found for actual parameters "int, java.lang.String, scala.collection.Seq"; candidates are: "$line14.$read$$iw$$iw$Foo1(int, java.lang.String, scala.collection.mutable.Seq)" /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private $line14.$read$$iw$$iw$Foo1 argValue; /* 010 */ private $line16.$read$$iw$$iw$Foo2 argValue1; /* 011 */ private int argValue2; /* 012 */ private java.lang.String argValue3; /* 013 */ private scala.collection.Seq argValue4; /* 014 */ private boolean resultIsNull; /* 015 */ private java.lang.Object[] argValue5; /* 016 */ private boolean MapObjects_loopIsNull1; /* 017 */ private int MapObjects_loopValue0; /* 018 */ private int argValue6; /* 019 */ private java.lang.String argValue7; /* 020 */ private boolean isNull31; /* 021 */ private boolean value31; /* 022 */ private boolean isNull32; /* 023 */ private $line16.$read$$iw$$iw$Foo2 value32; /* 024 */ private boolean isNull33; /* 025 */ private $line16.$read$$iw$$iw$Foo2 value33; /* 026 */ /* 027 */ public SpecificSafeProjection(Object[] references) { /* 028 */ this.references = references; /* 029 */ mutableRow = (InternalRow) references[references.length - 1]; /* 030 */ /* 031 */ /* 032 */ /* 033 */ /* 034 */ /* 035 */ /* 036 */ /* 037 */ /* 038 */ /* 039 */ /* 040 */ /* 041 */ isNull31 = false; /* 042 */ value31 = false; /* 043 */ isNull32 = false; /* 044 */ value32 = null; /* 045 */ isNull33 = false; /* 046 */ value33 = null; /* 047 */ /* 048 */ } /* 049 */ /* 050 */ public void initialize(int partitionIndex) { /* 051 */ /* 052 */ } /* 053 */ /* 054 */ /* 055 */ private void evalIfTrueExpr(InternalRow i) { /* 056 */ final $line16.$read$$iw$$iw$Foo2 value22 = null; /* 057 */ isNull32 = true; /* 058 */ value32 = value22; /* 059 */ } /* 060 */ /* 061 */ /* 062 */ private void apply_1(InternalRow i) { /* 063 */ /* 064 */ /* 065 */ resultIsNull = false; /* 066 */ if (!resultIsNull) { /* 067 */ /* 068 */ InternalRow value16 = i.getStruct(0, 3); /* 069 */ boolean isNull15 = false; /* 070 */ ArrayData value15 = null; /* 071 */ /* 072 */ /* 073 */ if (value16.isNullAt(2)) { /* 074 */ isNull15 = true; /* 075 */ } else { /* 076 */ value15 = value16.getArray(2); /* 077 */ } /* 078 */ ArrayData value14 = null; /* 079 */ /* 080 */ if (!isNull15) { /* 081 */ /* 082 */ Integer[] convertedArray = null; /* 083 */ int dataLength = value15.numElements(); /* 084 */ convertedArray = new Integer[dataLength]; /* 085 */ /*
[jira] [Commented] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq
[ https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212983#comment-16212983 ] Randy Tidd commented on SPARK-22296: The example above does not exhibit the problem. Here is a concise example that does. {quote}% spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/10/20 14:09:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/10/20 14:09:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://10.10.43.134:4040 Spark context available as 'sc' (master = local[*], app id = local-1508522945163). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> case class Foo1(x: Int, s: String, seq: scala.collection.mutable.Seq[Int] = scala.collection.mutable.Seq.empty[Int]) defined class Foo1 scala> val ds1 = Seq(Foo1(1, "a"), Foo1(2, "b")).toDS ds1: org.apache.spark.sql.Dataset[Foo1] = [x: int, s: string ... 1 more field] scala> case class Foo2(x: Int, s: String) defined class Foo2 scala> val ds2 = Seq(Foo2(1, "aa"), Foo2(3, "cc")).toDS ds2: org.apache.spark.sql.Dataset[Foo2] = [x: int, s: string] scala> ds1.joinWith(ds2, ds1.col("x") === ds2.col("x")).collect() 17/10/20 14:09:14 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 263, Column 69: No applicable constructor/method found for actual parameters "int, java.lang.String, scala.collection.Seq"; candidates are: "$line14.$read$$iw$$iw$Foo1(int, java.lang.String, scala.collection.mutable.Seq)" /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private $line14.$read$$iw$$iw$Foo1 argValue; /* 010 */ private $line16.$read$$iw$$iw$Foo2 argValue1; /* 011 */ private int argValue2; /* 012 */ private java.lang.String argValue3; /* 013 */ private scala.collection.Seq argValue4; /* 014 */ private boolean resultIsNull; /* 015 */ private java.lang.Object[] argValue5; /* 016 */ private boolean MapObjects_loopIsNull1; /* 017 */ private int MapObjects_loopValue0; /* 018 */ private int argValue6; /* 019 */ private java.lang.String argValue7; /* 020 */ private boolean isNull31; /* 021 */ private boolean value31; /* 022 */ private boolean isNull32; /* 023 */ private $line16.$read$$iw$$iw$Foo2 value32; /* 024 */ private boolean isNull33; /* 025 */ private $line16.$read$$iw$$iw$Foo2 value33; /* 026 */ /* 027 */ public SpecificSafeProjection(Object[] references) { /* 028 */ this.references = references; /* 029 */ mutableRow = (InternalRow) references[references.length - 1]; /* 030 */ /* 031 */ /* 032 */ /* 033 */ /* 034 */ /* 035 */ /* 036 */ /* 037 */ /* 038 */ /* 039 */ /* 040 */ /* 041 */ isNull31 = false; /* 042 */ value31 = false; /* 043 */ isNull32 = false; /* 044 */ value32 = null; /* 045 */ isNull33 = false; /* 046 */ value33 = null; /* 047 */ /* 048 */ } /* 049 */ /* 050 */ public void initialize(int partitionIndex) { /* 051 */ /* 052 */ } /* 053 */ /* 054 */ /* 055 */ private void evalIfTrueExpr(InternalRow i) { /* 056 */ final $line16.$read$$iw$$iw$Foo2 value22 = null; /* 057 */ isNull32 = true; /* 058 */ value32 = value22; /* 059 */ } /* 060 */ /* 061 */ /* 062 */ private void apply_1(InternalRow i) { /* 063 */ /* 064 */ /* 065 */ resultIsNull = false; /* 066 */ if (!resultIsNull) { /* 067 */ /* 068 */ InternalRow value16 = i.getStruct(0, 3); /* 069 */ boolean isNull15 = false; /* 070 */ ArrayData value15 = null; /* 071 */ /* 072 */ /* 073 */ if (value16.isNullAt(2)) { /* 074 */ isNull15 = true; /* 075 */ } else { /* 076 */ value15 = value16.getArray(2); /* 077 */ } /* 078 */ ArrayData value14 = null; /* 079 */ /* 080 */ if (!isNull15) { /* 081 */ /* 082 */ Integer[] convertedArray = null; /* 083 */ int dataLength = value15.numElements(); /* 084 */ convertedArray = new Integer[dataLength]; /* 085 */ /* 086 */ int loopIndex = 0; /* 087 */
[jira] [Commented] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212962#comment-16212962 ] Joseph K. Bradley commented on SPARK-21926: --- One more report from the dev list: HashingTF and IDFModel fail with Structured Streaming: http://apache-spark-developers-list.1001551.n3.nabble.com/HashingTFModel-IDFModel-in-Structured-Streaming-td22680.html > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Umbrella > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > a) Re-design vectorUDT metadata to support missing metadata for some > elements. (This might be a good thing to do anyways SPARK-19141) > b) drop metadata in streaming context. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21926: -- Shepherd: Joseph K. Bradley > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Umbrella > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > a) Re-design vectorUDT metadata to support missing metadata for some > elements. (This might be a good thing to do anyways SPARK-19141) > b) drop metadata in streaming context. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21926: -- Summary: Compatibility between ML Transformers and Structured Streaming (was: Some transformers in spark.ml.feature fail when trying to transform streaming dataframes) > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Bug > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > a) Re-design vectorUDT metadata to support missing metadata for some > elements. (This might be a good thing to do anyways SPARK-19141) > b) drop metadata in streaming context. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21926: -- Issue Type: Umbrella (was: Bug) > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Umbrella > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > a) Re-design vectorUDT metadata to support missing metadata for some > elements. (This might be a good thing to do anyways SPARK-19141) > b) drop metadata in streaming context. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
[ https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212932#comment-16212932 ] Michael Mior commented on SPARK-19007: -- I see from the following statement from the PR discussion, but I don't understand why this causes a problem. bq. it had to do with the fact that RDDs may be materialized later than checkpointer.update() gets called. > Speedup and optimize the GradientBoostedTrees in the "data>memory" scene > > > Key: SPARK-19007 > URL: https://issues.apache.org/jira/browse/SPARK-19007 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.1, 2.0.2, 2.1.0 > Environment: A CDH cluster consists of 3 redhat server ,(120G > memory、40 cores、43TB disk per server). >Reporter: zhangdenghui >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > Test data:80G CTR training data from > criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ > ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new > generated continuous features,the way to generate the new features refers to > the way mentioned in the xgboost's paper. > Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per > executor. > Parameters: numIterations 10, maxdepth 8, the rest parameters are default > I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data > mentioned above. > It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT > rounds later.Without these task failures and task retry it can be much faster > ,which can save about half the time. I think it's caused by the RDD named > predError in the while loop of the boost method at > GradientBoostedTrees.scala,because the lineage of the RDD named predError is > growing after every GBT round, and then it caused failures like this : > (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) > Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 > GB physical memory used. Consider boosting > spark.yarn.executor.memoryOverhead.). > I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it > needed is too much (even increase half the memory can't solve the problem) > so i think it's not a proper method. > Although it can set the predCheckpoint Interval smaller to cut the line of > the lineage but it increases IO cost a lot. > I tried another way to solve this problem.I persisted the RDD named > predError every round and use pre_predError to record the previous RDD and > unpersist it because it's useless anymore. > Finally it costs about 45 min after i tried my method and no task failure > occured and no more memeory added. > So when the data is much larger than memory, my little improvement can > speedup the GradientBoostedTrees 1~2 times. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
[ https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212924#comment-16212924 ] Michael Mior commented on SPARK-19007: -- [~josephkb] I'm confused why it's necessary to cache more than one RDD in the queue. It seems like there should never be a need for data from previous iterations if there's enough memory to cache the previous iteration. And if there isn't, trying to cache even more data seems like it would just make things worse. > Speedup and optimize the GradientBoostedTrees in the "data>memory" scene > > > Key: SPARK-19007 > URL: https://issues.apache.org/jira/browse/SPARK-19007 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.1, 2.0.2, 2.1.0 > Environment: A CDH cluster consists of 3 redhat server ,(120G > memory、40 cores、43TB disk per server). >Reporter: zhangdenghui >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > Test data:80G CTR training data from > criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ > ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new > generated continuous features,the way to generate the new features refers to > the way mentioned in the xgboost's paper. > Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per > executor. > Parameters: numIterations 10, maxdepth 8, the rest parameters are default > I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data > mentioned above. > It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT > rounds later.Without these task failures and task retry it can be much faster > ,which can save about half the time. I think it's caused by the RDD named > predError in the while loop of the boost method at > GradientBoostedTrees.scala,because the lineage of the RDD named predError is > growing after every GBT round, and then it caused failures like this : > (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) > Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 > GB physical memory used. Consider boosting > spark.yarn.executor.memoryOverhead.). > I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it > needed is too much (even increase half the memory can't solve the problem) > so i think it's not a proper method. > Although it can set the predCheckpoint Interval smaller to cut the line of > the lineage but it increases IO cost a lot. > I tried another way to solve this problem.I persisted the RDD named > predError every round and use pre_predError to record the previous RDD and > unpersist it because it's useless anymore. > Finally it costs about 45 min after i tried my method and no task failure > occured and no more memeory added. > So when the data is much larger than memory, my little improvement can > speedup the GradientBoostedTrees 1~2 times. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22209) PySpark does not recognize imports from submodules
[ https://issues.apache.org/jira/browse/SPARK-22209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212890#comment-16212890 ] Bryan Cutler edited comment on SPARK-22209 at 10/20/17 5:01 PM: It does seem like a bug to me so it should be fixed, I suspect it is in Sparks cloudpickle module. Would you be able to submit a patch for this? If not, I can try to take a look. cc [~holden.ka...@gmail.com] was (Author: bryanc): It does seem like a bug to me so it should be fixed, I suspect it is in Sparks cloudpickle module. Would you be able to submit a patch for this? If not, I can try to take a look. > PySpark does not recognize imports from submodules > -- > > Key: SPARK-22209 > URL: https://issues.apache.org/jira/browse/SPARK-22209 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.3.0 > Environment: Anaconda 4.4.0, Python 3.6, Hadoop 2.7, CDH 5.3.3, JDK > 1.8, Centos 6 >Reporter: Joel Croteau >Priority: Minor > > Using submodule syntax inside a PySpark job seems to create issues. For > example, the following: > {code:python} > import scipy.sparse > from pyspark import SparkContext, SparkConf > def do_stuff(x): > y = scipy.sparse.dok_matrix((1, 1)) > y[0, 0] = x > return y[0, 0] > def init_context(): > conf = SparkConf().setAppName("Spark Test") > sc = SparkContext(conf=conf) > return sc > def main(): > sc = init_context() > data = sc.parallelize([1, 2, 3, 4]) > output_data = data.map(do_stuff) > print(output_data.collect()) > __name__ == '__main__' and main() > {code} > produces this error: > {noformat} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.r
[jira] [Commented] (SPARK-22209) PySpark does not recognize imports from submodules
[ https://issues.apache.org/jira/browse/SPARK-22209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212890#comment-16212890 ] Bryan Cutler commented on SPARK-22209: -- It does seem like a bug to me so it should be fixed, I suspect it is in Sparks cloudpickle module. Would you be able to submit a patch for this? If not, I can try to take a look. > PySpark does not recognize imports from submodules > -- > > Key: SPARK-22209 > URL: https://issues.apache.org/jira/browse/SPARK-22209 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.3.0 > Environment: Anaconda 4.4.0, Python 3.6, Hadoop 2.7, CDH 5.3.3, JDK > 1.8, Centos 6 >Reporter: Joel Croteau >Priority: Minor > > Using submodule syntax inside a PySpark job seems to create issues. For > example, the following: > {code:python} > import scipy.sparse > from pyspark import SparkContext, SparkConf > def do_stuff(x): > y = scipy.sparse.dok_matrix((1, 1)) > y[0, 0] = x > return y[0, 0] > def init_context(): > conf = SparkConf().setAppName("Spark Test") > sc = SparkContext(conf=conf) > return sc > def main(): > sc = init_context() > data = sc.parallelize([1, 2, 3, 4]) > output_data = data.map(do_stuff) > print(output_data.collect()) > __name__ == '__main__' and main() > {code} > produces this error: > {noformat} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/home/matt/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", > line 177, in m
[jira] [Resolved] (SPARK-21055) Support grouping__id
[ https://issues.apache.org/jira/browse/SPARK-21055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21055. - Resolution: Fixed Assignee: cen yuhai Fix Version/s: 2.3.0 > Support grouping__id > > > Key: SPARK-21055 > URL: https://issues.apache.org/jira/browse/SPARK-21055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark2.1.1 >Reporter: cen yuhai >Assignee: cen yuhai > Fix For: 2.3.0 > > > Now, spark doesn't support grouping__id, spark provide another function > grouping_id() to workaround. > If use grouping_id(), many scripts need to change and supporting > grouping__id is very easy, why not? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22323) Design doc for different types of pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212778#comment-16212778 ] Apache Spark commented on SPARK-22323: -- User 'icexelloss' has created a pull request for this issue: https://github.com/apache/spark/pull/19544 > Design doc for different types of pandas_udf > > > Key: SPARK-22323 > URL: https://issues.apache.org/jira/browse/SPARK-22323 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin >Assignee: Apache Spark > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22323) Design doc for different types of pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22323: Assignee: Apache Spark > Design doc for different types of pandas_udf > > > Key: SPARK-22323 > URL: https://issues.apache.org/jira/browse/SPARK-22323 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin >Assignee: Apache Spark > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22323) Design doc for different types of pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22323: Assignee: (was: Apache Spark) > Design doc for different types of pandas_udf > > > Key: SPARK-22323 > URL: https://issues.apache.org/jira/browse/SPARK-22323 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19606) Support constraints in spark-dispatcher
[ https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212770#comment-16212770 ] Apache Spark commented on SPARK-19606: -- User 'pmackles' has created a pull request for this issue: https://github.com/apache/spark/pull/19543 > Support constraints in spark-dispatcher > --- > > Key: SPARK-19606 > URL: https://issues.apache.org/jira/browse/SPARK-19606 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Philipp Hoffmann > > The `spark.mesos.constraints` configuration is ignored by the > spark-dispatcher. The constraints need to be passed in the Framework > information when registering with Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22323) Design doc for different types of pandas_udf
Li Jin created SPARK-22323: -- Summary: Design doc for different types of pandas_udf Key: SPARK-22323 URL: https://issues.apache.org/jira/browse/SPARK-22323 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.2.0 Reporter: Li Jin Fix For: 2.3.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21750) Use arrow 0.6.0
[ https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212741#comment-16212741 ] Dongjoon Hyun commented on SPARK-21750: --- Hi, [~kiszk]. Two more Arrow releases seem to be out. How about the Python side? Can we catch up some? - 0.7.1 (1 October 2017) - 0.7.0 (17 September 2017) > Use arrow 0.6.0 > --- > > Key: SPARK-21750 > URL: https://issues.apache.org/jira/browse/SPARK-21750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been > released, use the latest one -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21750) Use arrow 0.6.0
[ https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212740#comment-16212740 ] Apache Spark commented on SPARK-21750: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/18974 > Use arrow 0.6.0 > --- > > Key: SPARK-21750 > URL: https://issues.apache.org/jira/browse/SPARK-21750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been > released, use the latest one -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22322) Update FutureAction for compatibility with Scala 2.12 future
Sean Owen created SPARK-22322: - Summary: Update FutureAction for compatibility with Scala 2.12 future Key: SPARK-22322 URL: https://issues.apache.org/jira/browse/SPARK-22322 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 2.3.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor The latest item on the road to Scala 2.12 is a change in Future, which adds new transform and transformWith method. The current stub implementation in place doesn't work, as it ends up being used by Scala APIs. The change is simple, to implement pass-through implementations in SimpleFutureAction, ComplexFutureAction, while making sure the method definitions compile in Scala 2.11 (where they'd be unused) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212640#comment-16212640 ] Dongjoon Hyun commented on SPARK-22320: --- Thank you, [~hyukjin.kwon]! > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212638#comment-16212638 ] Hyukjin Kwon commented on SPARK-22320: -- {quote} Is this ORC only issue? What about Parquet? {quote} This looks also happens in JSON too: {code} import org.apache.spark.ml.linalg._ val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, Array(4), Array(1.0 val df = data.toDF("i", "vec") df.write.json("/tmp/foo") spark.read.json("/tmp/foo").schema {code} {code} res1: org.apache.spark.sql.types.StructType = StructType(StructField(i,LongType,true), StructField(vec,StructType(StructField(indices,ArrayType(LongType,true),true), StructField(size,LongType,true), StructField(type,LongType,true), StructField(values,ArrayType(DoubleType,true),true)),true)) {code} Parquet seems fine: {code} import org.apache.spark.ml.linalg._ val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, Array(4), Array(1.0 val df = data.toDF("i", "vec") df.write.parquet("/tmp/bar") spark.read.parquet("/tmp/bar").schema {code} {code} res1: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) {code} {quote} Is this a regression in master branch? Could you check official Spark 2.2.0? {quote} This looks not a regression. I checked master too and it still happens. Will take a closer look soon. > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212637#comment-16212637 ] Dongjoon Hyun commented on SPARK-22320: --- Thank you, [~podongfeng]. I checked that and updated this JIRA info. > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212629#comment-16212629 ] zhengruifeng commented on SPARK-22320: -- I test on spark 2.2.0. parquet works fine. I have not try MatrixUDT. > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212624#comment-16212624 ] zhengruifeng commented on SPARK-22320: -- I test on spark 2.2.0. parquet works fine. I have not try MatrixUDT. -- 原始邮件 -- 发件人: "Dongjoon Hyun (JIRA)" ; 发送时间: 2017年10月20日(星期五) 19:34 收件人: "ruifengz" ; 主题: [jira] [Commented] (SPARK-22320) ORC should supportVectorUDT/MatrixUDT [ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212507#comment-16212507 ] Dongjoon Hyun commented on SPARK-22320: --- Thank you for reporting, [~podongfeng]. - Is this ORC only issue? What about Parquet? - Is this a regression in master branch? Could you check official Spark 2.2.0? -- This message was sent by Atlassian JIRA (v6.4.14#64029) > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22305) HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state
[ https://issues.apache.org/jira/browse/SPARK-22305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212608#comment-16212608 ] Yuval Itzchakov commented on SPARK-22305: - [~tdas] What do you think? > HDFSBackedStateStoreProvider fails with StackOverflowException when > attempting to recover state > --- > > Key: SPARK-22305 > URL: https://issues.apache.org/jira/browse/SPARK-22305 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Yuval Itzchakov > > Environment: > Spark: 2.2.0 > Java version: 1.8.0_112 > spark.sql.streaming.minBatchesToRetain: 100 > After an application failure due to OOM exceptions, restarting the > application with the existing state produces the following OOM: > {code:java} > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.StackOverflowError > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260) > at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215) > at > org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303) > at > org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269) > at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streamin
[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212601#comment-16212601 ] Dongjoon Hyun commented on SPARK-22320: --- Ping [~hyukjin.kwon] since you worked on SPARK-17765. > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212581#comment-16212581 ] Dongjoon Hyun commented on SPARK-22320: --- I checked that 2.2.0 and 2.1.2 has the same problem and 2.0.2 fails with the following exception. {code} scala> df.write.mode("overwrite").orc("/tmp/o3") 17/10/20 05:34:42 ERROR Utils: Aborting task java.lang.ClassCastException: org.apache.spark.ml.linalg.VectorUDT cannot be cast to org.apache.spark.sql.types.StructType at org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:558) {code} > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22320: -- Affects Version/s: 2.0.2 > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22320: -- Affects Version/s: 2.1.2 > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.2, 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22320: -- Affects Version/s: (was: 2.3.0) > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22320: -- Affects Version/s: 2.2.0 > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22321) Improve Logging in the mesos scheduler
Stavros Kontopoulos created SPARK-22321: --- Summary: Improve Logging in the mesos scheduler Key: SPARK-22321 URL: https://issues.apache.org/jira/browse/SPARK-22321 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 2.2.0 Reporter: Stavros Kontopoulos There are certain places where logging must be improved. For example when offers are rescinded or tasks were not launched due to unmet constraints in MesosCoarseGrainedSchedulerBackend. There is valuable info at the mesos layer we could expose to get a better understanding of what is happening in one place. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22321) Improve logging in the mesos scheduler
[ https://issues.apache.org/jira/browse/SPARK-22321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22321: Summary: Improve logging in the mesos scheduler (was: Improve Logging in the mesos scheduler) > Improve logging in the mesos scheduler > -- > > Key: SPARK-22321 > URL: https://issues.apache.org/jira/browse/SPARK-22321 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Stavros Kontopoulos > > There are certain places where logging must be improved. For example when > offers are rescinded or tasks were not launched due to unmet constraints in > MesosCoarseGrainedSchedulerBackend. There is valuable info at the mesos layer > we could expose to get a better understanding of what is happening in one > place. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark
[ https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-18935: Component/s: Mesos > Use Mesos "Dynamic Reservation" resource for Spark > -- > > Key: SPARK-18935 > URL: https://issues.apache.org/jira/browse/SPARK-18935 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: jackyoh > > I'm running spark on Apache Mesos > Please follow these steps to reproduce the issue: > 1. First, run Mesos resource reserve: > curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d > resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]' > -X POST http://192.168.1.118:5050/master/reserve > 2. Then run spark-submit command: > ./spark-submit --class org.apache.spark.examples.SparkPi --master > mesos://192.168.1.118:5050 --conf spark.mesos.role=spark > ../examples/jars/spark-examples_2.11-2.0.2.jar 1 > And the console will keep loging same warning message as shown below: > 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any > resources; check your cluster UI to ensure that workers are registered and > have sufficient resources -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212522#comment-16212522 ] Bartosz Mścichowski commented on SPARK-21492: - Here's a script that exposes memory leak during SortMergeJoin in Spark 2.2.0, maybe it will be helpful. Memory leak happens when the following code is executed in spark-shell (a local one). {{--conf spark.sql.autoBroadcastJoinThreshold=-1}} may be needed to ensure proper join type. {noformat} import org.apache.spark.sql.types._ import org.apache.spark.sql._ import org.apache.spark.sql.functions._ val table1Key = "t1_key" val table1Value = "t1_value" val table2Key = "t2_key" val table2Value = "t2_value" val table1Schema = StructType(List( StructField(table1Key, IntegerType), StructField(table1Value, DoubleType) )); val table2Schema = StructType(List( StructField(table2Key, IntegerType), StructField(table2Value, DoubleType) )); val table1 = spark.sqlContext.createDataFrame( rowRDD = spark.sparkContext.parallelize(Seq( Row(1, 2.0) )), schema = table1Schema ); val table2 = spark.sqlContext.createDataFrame( rowRDD = spark.sparkContext.parallelize(Seq( Row(1, 4.0) )), schema = table2Schema ); val t1 = table1.repartition(col(table1Key)).groupBy(table1Key).avg() val t2 = table2.repartition(col(table2Key)).groupBy(table2Key).avg() val joinedDF = t1 join t2 where t1(table1Key) === t2(table2Key) joinedDF.explain() // == Physical Plan == // *SortMergeJoin [t1_key#2], [t2_key#9], Inner // :- *Sort [t1_key#2 ASC NULLS FIRST], false, 0 // : +- *HashAggregate(keys=[t1_key#2], functions=[avg(cast(t1_key#2 as bigint)), avg(t1_value#3)]) // : +- *HashAggregate(keys=[t1_key#2], functions=[partial_avg(cast(t1_key#2 as bigint)), partial_avg(t1_value#3)]) // :+- Exchange hashpartitioning(t1_key#2, 200) // : +- *Filter isnotnull(t1_key#2) // : +- Scan ExistingRDD[t1_key#2,t1_value#3] // +- *Sort [t2_key#9 ASC NULLS FIRST], false, 0 //+- *HashAggregate(keys=[t2_key#9], functions=[avg(cast(t2_key#9 as bigint)), avg(t2_value#10)]) // +- *HashAggregate(keys=[t2_key#9], functions=[partial_avg(cast(t2_key#9 as bigint)), partial_avg(t2_value#10)]) // +- Exchange hashpartitioning(t2_key#9, 200) // +- *Filter isnotnull(t2_key#9) //+- Scan ExistingRDD[t2_key#9,t2_value#10] joinedDF.show() // The 'show' action yields a lot of: // 17/10/19 08:17:39 WARN executor.Executor: Managed memory leak detected; size = 4194304 bytes, TID = 8 // 17/10/19 08:17:39 WARN executor.Executor: Managed memory leak detected; size = 4194304 bytes, TID = 9 // 17/10/19 08:17:39 WARN executor.Executor: Managed memory leak detected; size = 4194304 bytes, TID = 11 {noformat} > Memory leak in SortMergeJoin > > > Key: SPARK-21492 > URL: https://issues.apache.org/jira/browse/SPARK-21492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhan Zhang > > In SortMergeJoin, if the iterator is not exhausted, there will be memory leak > caused by the Sort. The memory is not released until the task end, and cannot > be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212513#comment-16212513 ] Ben commented on SPARK-22284: - [~kiszk], happy to help, and thanks for your assistance. So, if I understand correctly, there was a similar ticket and it was solved, but my case is more complicated, right? There are some cases in the files where the structures may indeed get very complex. Would this case also be solvable? > Code of class > \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" > grows beyond 64 KB > -- > > Key: SPARK-22284 > URL: https://issues.apache.org/jira/browse/SPARK-22284 > Project: Spark > Issue Type: Bug > Components: Optimizer, PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Ben > Attachments: 64KB Error.log > > > I am using pySpark 2.1.0 in a production environment, and trying to join two > DataFrames, one of which is very large and has complex nested structures. > Basically, I load both DataFrames and cache them. > Then, in the large DataFrame, I extract 3 nested values and save them as > direct columns. > Finally, I join on these three columns with the smaller DataFrame. > This would be a short code for this: > {code} > dataFrame.read..cache() > dataFrameSmall.read...cache() > dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS > Value1','nested.Value2 AS Value2','nested.Value3 AS Value3']) > dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, > ['Value1','Value2',Value3']) > dataFrame.count() > {code} > And this is the error I get when it gets to the count(): > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in > stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 > (TID 11234, somehost.com, executor 10): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\" > of class > \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" > grows beyond 64 KB > {code} > I have seen many tickets with similar issues here, but no proper solution. > Most of the fixes are until Spark 2.1.0 so I don't know if running it on > Spark 2.2.0 would fix it. In any case I cannot change the version of Spark > since it is in production. > I have also tried setting > {code:java} > spark.sql.codegen.wholeStage=false > {code} > but still the same error. > The job worked well up to now, also with large datasets, but apparently this > batch got larger, and that is the only thing that changed. Is there any > workaround for this? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212507#comment-16212507 ] Dongjoon Hyun commented on SPARK-22320: --- Thank you for reporting, [~podongfeng]. - Is this ORC only issue? What about Parquet? - Is this a regression in master branch? Could you check official Spark 2.2.0? > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-22320: - Description: I save dataframe containing vectors in ORC format, when I read it back, the format is changed. {code} scala> import org.apache.spark.ml.linalg._ import org.apache.spark.ml.linalg._ scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, Array(4), Array(1.0 data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), (2,(8,[4],[1.0]))) scala> val df = data.toDF("i", "vec") df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] scala> df.schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,false), StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) scala> df.write.orc("/tmp/123") scala> val df2 = spark.sqlContext.read.orc("/tmp/123") df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct] scala> df2.schema res3: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(vec,StructType(StructField(type,ByteType,true), StructField(size,IntegerType,true), StructField(indices,ArrayType(IntegerType,true),true), StructField(values,ArrayType(DoubleType,true),true)),true)) {code} was: I save dataframe containing {{ml.Vector}}s in ORC format, when I read it back, the format is changed. {code} scala> import org.apache.spark.ml.linalg._ import org.apache.spark.ml.linalg._ scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, Array(4), Array(1.0 data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), (2,(8,[4],[1.0]))) scala> val df = data.toDF("i", "vec") df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] scala> df.schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,false), StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) scala> df.write.orc("/tmp/123") scala> val df2 = spark.sqlContext.read.orc("/tmp/123") df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct] scala> df2.schema res3: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(vec,StructType(StructField(type,ByteType,true), StructField(size,IntegerType,true), StructField(indices,ArrayType(IntegerType,true),true), StructField(values,ArrayType(DoubleType,true),true)),true)) {code} > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-22320: - Summary: ORC should support VectorUDT/MatrixUDT (was: ORC should support VectorUDF/MatrixUDT) > ORC should support VectorUDT/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > I save dataframe containing {{ml.Vector}}s in ORC format, when I read it > back, the format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22320) ORC should support VectorUDF/MatrixUDT
[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-22320: - Description: I save dataframe containing {{ml.Vector}}s in ORC format, when I read it back, the format is changed. {code} scala> import org.apache.spark.ml.linalg._ import org.apache.spark.ml.linalg._ scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, Array(4), Array(1.0 data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), (2,(8,[4],[1.0]))) scala> val df = data.toDF("i", "vec") df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] scala> df.schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,false), StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) scala> df.write.orc("/tmp/123") scala> val df2 = spark.sqlContext.read.orc("/tmp/123") df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct] scala> df2.schema res3: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(vec,StructType(StructField(type,ByteType,true), StructField(size,IntegerType,true), StructField(indices,ArrayType(IntegerType,true),true), StructField(values,ArrayType(DoubleType,true),true)),true)) {code} was: I save dataframe containing ((ml.Vector}}s in ORC format, when I read it back, the format is changed. {code} scala> import org.apache.spark.ml.linalg._ import org.apache.spark.ml.linalg._ scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, Array(4), Array(1.0 data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), (2,(8,[4],[1.0]))) scala> val df = data.toDF("i", "vec") df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] scala> df.schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,false), StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) scala> df.write.orc("/tmp/123") scala> val df2 = spark.sqlContext.read.orc("/tmp/123") df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct] scala> df2.schema res3: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(vec,StructType(StructField(type,ByteType,true), StructField(size,IntegerType,true), StructField(indices,ArrayType(IntegerType,true),true), StructField(values,ArrayType(DoubleType,true),true)),true)) {code} > ORC should support VectorUDF/MatrixUDT > -- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > I save dataframe containing {{ml.Vector}}s in ORC format, when I read it > back, the format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22320) ORC should support VectorUDF/MatrixUDT
zhengruifeng created SPARK-22320: Summary: ORC should support VectorUDF/MatrixUDT Key: SPARK-22320 URL: https://issues.apache.org/jira/browse/SPARK-22320 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: zhengruifeng I save dataframe containing ((ml.Vector}}s in ORC format, when I read it back, the format is changed. {code} scala> import org.apache.spark.ml.linalg._ import org.apache.spark.ml.linalg._ scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, Array(4), Array(1.0 data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), (2,(8,[4],[1.0]))) scala> val df = data.toDF("i", "vec") df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] scala> df.schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,false), StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) scala> df.write.orc("/tmp/123") scala> val df2 = spark.sqlContext.read.orc("/tmp/123") df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct] scala> df2.schema res3: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(vec,StructType(StructField(type,ByteType,true), StructField(size,IntegerType,true), StructField(indices,ArrayType(IntegerType,true),true), StructField(values,ArrayType(DoubleType,true),true)),true)) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine
[ https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212462#comment-16212462 ] Rob Vesse commented on SPARK-9: --- [~yuvaldeg], thanks for providing additional clarifications If a library is considered to be a standard part of the platform software then it should fall under the foundations standards [platform|http://www.apache.org/legal/resolved.html#platform] resolution that licensing of the platform does generally not affect the software running upon it. And if there are other Apache projects already depending on this that provides a precedent that Spark can rely on. > SPIP: RDMA Accelerated Shuffle Engine > - > > Key: SPARK-9 > URL: https://issues.apache.org/jira/browse/SPARK-9 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Yuval Degani > Attachments: > SPARK-9_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf > > > An RDMA-accelerated shuffle engine can provide enormous performance benefits > to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin > open-source project ([https://github.com/Mellanox/SparkRDMA]). > Using RDMA for shuffle improves CPU utilization significantly and reduces I/O > processing overhead by bypassing the kernel and networking stack as well as > avoiding memory copies entirely. Those valuable CPU cycles are then consumed > directly by the actual Spark workloads, and help reducing the job runtime > significantly. > This performance gain is demonstrated with both industry standard HiBench > TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive > customer applications. > SparkRDMA will be presented at Spark Summit 2017 in Dublin > ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]). > Please see attached proposal document for more information. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq
[ https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22296. --- Resolution: Duplicate > CodeGenerator - failed to compile when constructor has > scala.collection.mutable.Seq vs. scala.collection.Seq > > > Key: SPARK-22296 > URL: https://issues.apache.org/jira/browse/SPARK-22296 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Randy Tidd > > This is with Scala 2.11. > We have a case class that has a constructor with 85 args, the last two of > which are: > var chargesInst : > scala.collection.mutable.Seq[ChargeInstitutional] = > scala.collection.mutable.Seq.empty[ChargeInstitutional], > var chargesProf : > scala.collection.mutable.Seq[ChargeProfessional] = > scala.collection.mutable.Seq.empty[ChargeProfessional] > A mutable Seq in a the constructor of a case class is probably poor form but > Scala allows it. When we run this job we get this error: > build 17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch > worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 8217, Column 70: No applicable constructor/method found for actual parameters > "java.lang.String, java.lang.String, long, java.lang.String, long, long, > long, java.lang.String, long, long, double, scala.Option, scala.Option, > java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, long, long, long, long, > scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, java.lang.String, int, double, > double, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, > com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, > java.lang.String, long, java.lang.String, int, int, boolean, boolean, > scala.collection.Seq, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, scala.collection.Seq"; candidates are: > "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, > java.lang.String, long, long, long, java.lang.String, long, long, double, > scala.Option, scala.Option, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, long, scala.Option, scala.Option, scala.Option, > scala.Option, scala.Option, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, int, double, double, java.lang.String, java.lang.String, > java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, long, long, long, long, java.lang.String, > com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, > scala.collection.Seq, scala.collection.Seq, java.lang.String, long, > java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, boolean, scala.collection.mutable.Seq, > scala.collection.mutable.Seq)" > The relevant lines are: > build 17-Oct-2017 05:30:50/* 093 */ private scala.collection.Seq > argValue84; > build 17-Oct-2017 05:30:50/* 094 */ private scala.collection.Seq > argValue85; > and > build 17-Oct-2017 05:30:54/* 8217 */ final > com.xyz.xyz.xyz.domain.Account value1 = false ? null : new > com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, > argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, > argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, > argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, > argValue24, argValue25, argValue26, argValue27, argValue28, argValue29, > argValue30, argValue31, argValue32, argValue33, argValue34, argValue35, > argValue36, argValue37, argValue38, argValue39, argValue40, a
[jira] [Resolved] (SPARK-22307) NOT condition working incorrectly
[ https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22307. --- Resolution: Not A Problem > NOT condition working incorrectly > - > > Key: SPARK-22307 > URL: https://issues.apache.org/jira/browse/SPARK-22307 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: Andrey Yakovenko > Attachments: Catalog.json.gz > > > Suggest test case: table with x record filtered by expression expr returns y > records (< x), not(expr) does not returns x-y records. Work around: > when(expr, false).otherwise(true) is working fine. > {code} > val ctg = spark.sqlContext.read.json("/user/Catalog.json") > scala> ctg.printSchema > root > |-- Id: string (nullable = true) > |-- Name: string (nullable = true) > |-- Parent: struct (nullable = true) > ||-- Id: string (nullable = true) > ||-- Name: string (nullable = true) > ||-- Parent: struct (nullable = true) > |||-- Id: string (nullable = true) > |||-- Name: string (nullable = true) > |||-- Parent: struct (nullable = true) > ||||-- Id: string (nullable = true) > ||||-- Name: string (nullable = true) > ||||-- Parent: string (nullable = true) > ||||-- SKU: string (nullable = true) > |||-- SKU: string (nullable = true) > ||-- SKU: string (nullable = true) > |-- SKU: string (nullable = true) > val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN > ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', > '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))") > col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR > (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, > 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4))) > scala> ctg.count > res5: Long = 623 > scala> ctg.filter(col1).count > res2: Long = 2 > scala> ctg.filter(not(col1)).count > res3: Long = 4 > scala> ctg.filter(when(col1, false).otherwise(true)).count > res4: Long = 621 > {code} > Table is hierarchy like structure and has a records with different number of > levels filled up. I have a suspicion that due to partly filled hierarchy > condition return null/undefined/failed/nan some times (neither true or false). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
[ https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21551: -- Fix Version/s: 2.1.3 > pyspark's collect fails when getaddrinfo is too slow > > > Key: SPARK-21551 > URL: https://issues.apache.org/jira/browse/SPARK-21551 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: peay >Assignee: peay >Priority: Critical > Fix For: 2.1.3, 2.2.1, 2.3.0 > > > Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and > {{DataFrame.collect}} all work by starting an ephemeral server in the driver, > and having Python connect to it to download the data. > All three are implemented along the lines of: > {code} > port = self._jdf.collectToPython() > return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( > {code} > The server has **a hardcoded timeout of 3 seconds** > (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) > -- i.e., the Python process has 3 seconds to connect to it from the very > moment the driver server starts. > In general, that seems fine, but I have been encountering frequent timeouts > leading to `Exception: could not open socket`. > After investigating a bit, it turns out that {{_load_from_socket}} makes a > call to {{getaddrinfo}}: > {code} > def _load_from_socket(port, serializer): > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): >.. connect .. > {code} > I am not sure why, but while most such calls to {{getaddrinfo}} on my machine > only take a couple milliseconds, about 10% of them take between 2 and 10 > seconds, leading to about 10% of jobs failing. I don't think we can always > expect {{getaddrinfo}} to return instantly. More generally, Python may > sometimes pause for a couple seconds, which may not leave enough time for the > process to connect to the server. > Especially since the server timeout is hardcoded, I think it would be best to > set a rather generous value (15 seconds?) to avoid such situations. > A {{getaddrinfo}} specific fix could avoid doing it every single time, or do > it before starting up the driver server. > > cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22309) Remove unused param in `LDAModel.getTopicDistributionMethod`
[ https://issues.apache.org/jira/browse/SPARK-22309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22309. --- Resolution: Duplicate This is actually just an addition to your change in SPARK-20930 > Remove unused param in `LDAModel.getTopicDistributionMethod` > > > Key: SPARK-22309 > URL: https://issues.apache.org/jira/browse/SPARK-22309 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Priority: Trivial > > 1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} > is redundant. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22318) spark stream Kafka hang at JavaStreamingContext.start, no spark job create
[ https://issues.apache.org/jira/browse/SPARK-22318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22318. --- Resolution: Invalid There's no bug reported here. It's normal that the job waits at start() forever. You need to narrow it down much further; maybe there is no data arriving. > spark stream Kafka hang at JavaStreamingContext.start, no spark job create > -- > > Key: SPARK-22318 > URL: https://issues.apache.org/jira/browse/SPARK-22318 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: OS:Red Hat Enterprise Linux Server release 6.5 > JRE:Oracle 1.8.0.144-b01 > spark-streaming_2.11:2.1.0 > spark-streaming-kafka-0-10_2.11:2.1.0 >Reporter: iceriver322 > > spark stream Kafka jar submitted by spark-submit to standalone spark cluster, > and running well for a few days. But recently, we find that no new job > generated for the stream, we tried to restart the job, and restart the > cluster, the stream just stuck at JavaStreamingContext.start, and WAITING > (on object monitor). Thread dump : > 2017-10-19 16:44:23 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode): > "Attach Listener" #82 daemon prio=9 os_prio=0 tid=0x7f76f0002000 > nid=0x3f80 waiting on condition [0x] >java.lang.Thread.State: RUNNABLE >Locked ownable synchronizers: > - None > "SparkUI-JettyScheduler" #81 daemon prio=5 os_prio=0 tid=0x7f76ac002800 > nid=0x3d5c waiting on condition [0x7f7693bfa000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0xfa19f940> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - None > "shuffle-server-3-4" #35 daemon prio=5 os_prio=0 tid=0x7f76a0041800 > nid=0x3d34 runnable [0x7f76911e5000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0xf8ea3be8> (a > io.netty.channel.nio.SelectedSelectionKeySet) > - locked <0xf8ee3600> (a > java.util.Collections$UnmodifiableSet) > - locked <0xf8ea3ae0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:760) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:401) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:748) >Locked ownable synchronizers: > - None > "shuffle-server-3-3" #34 daemon prio=5 os_prio=0 tid=0x7f76a0040800 > nid=0x3d33 runnable [0x7f76912e6000] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0xfc2747c0> (a > io.netty.channel.nio.SelectedSelectionKeySet) > - locked <0xfc2874c0> (a > java.util.Collections$UnmodifiableSet) > - locked <0xfc2746c8> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:760) > at io.netty.channel.nio.NioEventLoop.run(