date:20171020

[jira] [Commented] (SPARK-21055) Support grouping__id

2017-10-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213758#comment-16213758
 ] 

Apache Spark commented on SPARK-21055:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19546

> Support grouping__id
> 
>
> Key: SPARK-21055
> URL: https://issues.apache.org/jira/browse/SPARK-21055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark2.1.1
>Reporter: cen yuhai
>Assignee: cen yuhai
> Fix For: 2.3.0
>
>
> Now, spark doesn't support grouping__id, spark provide another function 
> grouping_id() to workaround. 
> If use grouping_id(), many scripts need to change and supporting  
> grouping__id is very easy, why not?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22326) Remove unnecessary hashCode and equals methods

2017-10-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22326.
-
   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.3.0

> Remove unnecessary hashCode and equals methods
> --
>
> Key: SPARK-22326
> URL: https://issues.apache.org/jira/browse/SPARK-22326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>
> Plan equality should be computed by canonicalized, so we can remove 
> unnecessary hashCode and equals methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21750) Use arrow 0.6.0

2017-10-20 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213702#comment-16213702
 ] 

Kazuaki Ishizaki commented on SPARK-21750:
--

Thank you. Good to hear that.

> Use arrow 0.6.0
> ---
>
> Key: SPARK-21750
> URL: https://issues.apache.org/jira/browse/SPARK-21750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been 
> released, use the latest one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22326) Remove unnecessary hashCode and equals methods

2017-10-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22326:


Assignee: Apache Spark

> Remove unnecessary hashCode and equals methods
> --
>
> Key: SPARK-22326
> URL: https://issues.apache.org/jira/browse/SPARK-22326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> Plan equality should be computed by canonicalized, so we can remove 
> unnecessary hashCode and equals methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22326) Remove unnecessary hashCode and equals methods

2017-10-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22326:


Assignee: (was: Apache Spark)

> Remove unnecessary hashCode and equals methods
> --
>
> Key: SPARK-22326
> URL: https://issues.apache.org/jira/browse/SPARK-22326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Plan equality should be computed by canonicalized, so we can remove 
> unnecessary hashCode and equals methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22326) Remove unnecessary hashCode and equals methods

2017-10-20 Thread Zhenhua Wang (JIRA)

Zhenhua Wang created SPARK-22326:


 Summary: Remove unnecessary hashCode and equals methods
 Key: SPARK-22326
 URL: https://issues.apache.org/jira/browse/SPARK-22326
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang


Plan equality should be computed by canonicalized, so we can remove unnecessary 
hashCode and equals methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22326) Remove unnecessary hashCode and equals methods

2017-10-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213666#comment-16213666
 ] 

Apache Spark commented on SPARK-22326:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19539

> Remove unnecessary hashCode and equals methods
> --
>
> Key: SPARK-22326
> URL: https://issues.apache.org/jira/browse/SPARK-22326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Plan equality should be computed by canonicalized, so we can remove 
> unnecessary hashCode and equals methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22307) NOT condition working incorrectly

2017-10-20 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213432#comment-16213432
 ] 

kevin yu commented on SPARK-22307:
--

It is correct behavior based on SQL standards, as Marco said. Your query has 
623 records: 617 records are null, 2 records are 'true', and 4 records are 
'false'. So the not(col1) return 4. 

> NOT condition working incorrectly
> -
>
> Key: SPARK-22307
> URL: https://issues.apache.org/jira/browse/SPARK-22307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Andrey Yakovenko
> Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  ||-- Id: string (nullable = true)
>  ||-- Name: string (nullable = true)
>  ||-- Parent: struct (nullable = true)
>  |||-- Id: string (nullable = true)
>  |||-- Name: string (nullable = true)
>  |||-- Parent: struct (nullable = true)
>  ||||-- Id: string (nullable = true)
>  ||||-- Name: string (nullable = true)
>  ||||-- Parent: string (nullable = true)
>  ||||-- SKU: string (nullable = true)
>  |||-- SKU: string (nullable = true)
>  ||-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
> ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
> '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
> (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
> 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of 
> levels filled up. I have a suspicion that due to partly filled hierarchy 
> condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21929) Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source

2017-10-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21929:


Assignee: (was: Apache Spark)

> Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source
> 
>
> Key: SPARK-21929
> URL: https://issues.apache.org/jira/browse/SPARK-21929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19261 implemented `ADD COLUMNS` at Spark 2.2, but ORC data source is 
> not supported due to its limit.
> {code}
> scala> sql("CREATE TABLE tab (c1 int, c2 int, c3 int) USING ORC PARTITIONED 
> BY (c3)")
> scala> sql("ALTER TABLE tab ADD COLUMNS (c4 int)")
> org.apache.spark.sql.AnalysisException:
> ALTER ADD COLUMNS does not support datasource table with type ORC.
> You must drop and re-create the table for adding the new columns. Tables: 
> `tab`;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21929) Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source

2017-10-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213272#comment-16213272
 ] 

Apache Spark commented on SPARK-21929:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19545

> Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source
> 
>
> Key: SPARK-21929
> URL: https://issues.apache.org/jira/browse/SPARK-21929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19261 implemented `ADD COLUMNS` at Spark 2.2, but ORC data source is 
> not supported due to its limit.
> {code}
> scala> sql("CREATE TABLE tab (c1 int, c2 int, c3 int) USING ORC PARTITIONED 
> BY (c3)")
> scala> sql("ALTER TABLE tab ADD COLUMNS (c4 int)")
> org.apache.spark.sql.AnalysisException:
> ALTER ADD COLUMNS does not support datasource table with type ORC.
> You must drop and re-create the table for adding the new columns. Tables: 
> `tab`;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21929) Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source

2017-10-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21929:


Assignee: Apache Spark

> Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source
> 
>
> Key: SPARK-21929
> URL: https://issues.apache.org/jira/browse/SPARK-21929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> SPARK-19261 implemented `ADD COLUMNS` at Spark 2.2, but ORC data source is 
> not supported due to its limit.
> {code}
> scala> sql("CREATE TABLE tab (c1 int, c2 int, c3 int) USING ORC PARTITIONED 
> BY (c3)")
> scala> sql("ALTER TABLE tab ADD COLUMNS (c4 int)")
> org.apache.spark.sql.AnalysisException:
> ALTER ADD COLUMNS does not support datasource table with type ORC.
> You must drop and re-create the table for adding the new columns. Tables: 
> `tab`;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2017-10-20 Thread Cosmin Lehene (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213216#comment-16213216
 ] 

Cosmin Lehene edited comment on SPARK-11844 at 10/20/17 8:59 PM:
-

I'm seeing these with spark-2.2.1-SNAPSHOT
{noformat}
java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
don't know what type: 14
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:835)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:849)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:700)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: don't 
know what type: 14
at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
at 
org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
at org.apache.parquet.format.PageHeader.read(PageHeader.java:828)
at org.apache.parquet.format.Util.read(Util.java:213)
... 24 more
{noformat}

Also they seem accompanied by 

{noformat}
java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
Required field 'uncompressed_page_size' was not found in serialized data! 
Struct: PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:835)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:849)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:700)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.execution.datasources.FileSc

[jira] [Commented] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2017-10-20 Thread Cosmin Lehene (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213216#comment-16213216
 ] 

Cosmin Lehene commented on SPARK-11844:
---

I'm seeing these with spark-2.2.1-SNAPSHOT
{noformat}
java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
don't know what type: 14
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:835)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:849)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:700)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: don't 
know what type: 14
at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
at 
org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
at org.apache.parquet.format.PageHeader.read(PageHeader.java:828)
at org.apache.parquet.format.Util.read(Util.java:213)
... 24 more
{noformat}

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   a

[jira] [Comment Edited] (SPARK-22323) Design doc for different types of pandas_udf

2017-10-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213182#comment-16213182
 ] 

Bryan Cutler edited comment on SPARK-22323 at 10/20/17 8:30 PM:


Is this meant to be a user doc?


was (Author: bryanc):
I this meant to be a user doc?

> Design doc for different types of pandas_udf
> 
>
> Key: SPARK-22323
> URL: https://issues.apache.org/jira/browse/SPARK-22323
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22323) Design doc for different types of pandas_udf

2017-10-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213182#comment-16213182
 ] 

Bryan Cutler commented on SPARK-22323:
--

I this meant to be a user doc?

> Design doc for different types of pandas_udf
> 
>
> Key: SPARK-22323
> URL: https://issues.apache.org/jira/browse/SPARK-22323
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22323) Design doc for different types of pandas_udf

2017-10-20 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213167#comment-16213167
 ] 

Li Jin commented on SPARK-22323:


Bryan I think Spark-1 is more about types between spark/arrow, like
time stamp caveats.

This is only meant to cover different types of pandas udf.



> Design doc for different types of pandas_udf
> 
>
> Key: SPARK-22323
> URL: https://issues.apache.org/jira/browse/SPARK-22323
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22323) Design doc for different types of pandas_udf

2017-10-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213151#comment-16213151
 ] 

Bryan Cutler commented on SPARK-22323:
--

Should I close SPARK-1 since it looks like the docs will be covered here?

> Design doc for different types of pandas_udf
> 
>
> Key: SPARK-22323
> URL: https://issues.apache.org/jira/browse/SPARK-22323
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-22325) SPARK_TESTING env variable breaking non-spark builds on amplab jenkins

2017-10-20 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin deleted SPARK-22325:



> SPARK_TESTING env variable breaking non-spark builds on amplab jenkins
> --
>
> Key: SPARK-22325
> URL: https://issues.apache.org/jira/browse/SPARK-22325
> Project: Spark
>  Issue Type: Bug
> Environment: riselab jenkins, all workers (ubuntu & centos)
>Reporter: shane knapp
>Priority: Critical
>
> in the riselab jenkins master config, the SPARK_TESTING environment variable 
> is set to 1 and applied to all workers.
> see:  
> https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-foo/9/console
>   (the 'echo 1' is actually 'echo $SPARK_TESTING')
> and:  
> https://amplab.cs.berkeley.edu/jenkins/job/testing-foo/10/injectedEnvVars/
> this is problematic, as some of our lab builds are attempting to run pyspark 
> as part of the build process, and the hard-coded checks for SPARK_TESTING in 
> the setup scripts are causing hard failures.
> see:  
> https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/2440/HADOOP_VERSION=2.6.2,SCALAVER=2.11,SPARK_VERSION=2.2.0,label=centos/consoleFull
> i would strongly suggest that we do the following:
> * remove the SPARK_TESTING environment variable declaration in the jenkins 
> config
> * add the environment variable to each spark build config in github:  
> https://github.com/databricks/spark-jenkins-configurations/
> * add the environment variable to SparkPullRequstBuilder and 
> NewSparkPullRequestBuilder



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22325) SPARK_TESTING env variable breaking non-spark builds on amplab jenkins

2017-10-20 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-22325:

Description: 
in the riselab jenkins master config, the SPARK_TESTING environment variable is 
set to 1 and applied to all workers.

see:  
https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-foo/9/console
  (the 'echo 1' is actually 'echo $SPARK_TESTING')

and:  https://amplab.cs.berkeley.edu/jenkins/job/testing-foo/10/injectedEnvVars/

this is problematic, as some of our lab builds are attempting to run pyspark as 
part of the build process, and the hard-coded checks for SPARK_TESTING in the 
setup scripts are causing hard failures.

see:  
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/2440/HADOOP_VERSION=2.6.2,SCALAVER=2.11,SPARK_VERSION=2.2.0,label=centos/consoleFull

i would strongly suggest that we do the following:
* remove the SPARK_TESTING environment variable declaration in the jenkins 
config
* add the environment variable to each spark build config in github:  
https://github.com/databricks/spark-jenkins-configurations/
* add the environment variable to SparkPullRequstBuilder and 
NewSparkPullRequestBuilder



  was:
in the riselab jenkins master config, the SPARK_TESTING environment variable is 
set to 1 and applied to all workers.

see:  
https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-foo/9/console

(the 'echo 1' is actually 'echo $SPARK_TESTING')

this is problematic, as some of our lab builds are attempting to run pyspark as 
part of the build process, and the hard-coded checks for SPARK_TESTING in the 
setup scripts are causing hard failures.

see:  
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/2440/HADOOP_VERSION=2.6.2,SCALAVER=2.11,SPARK_VERSION=2.2.0,label=centos/consoleFull

i would strongly suggest that we do the following:
* remove the SPARK_TESTING environment variable declaration in the jenkins 
config
* add the environment variable to each spark build config in github:  
https://github.com/databricks/spark-jenkins-configurations/
* add the environment variable to SparkPullRequstBuilder and 
NewSparkPullRequestBuilder




> SPARK_TESTING env variable breaking non-spark builds on amplab jenkins
> --
>
> Key: SPARK-22325
> URL: https://issues.apache.org/jira/browse/SPARK-22325
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 2.2.0
> Environment: riselab jenkins, all workers (ubuntu & centos)
>Reporter: shane knapp
>Priority: Critical
>
> in the riselab jenkins master config, the SPARK_TESTING environment variable 
> is set to 1 and applied to all workers.
> see:  
> https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-foo/9/console
>   (the 'echo 1' is actually 'echo $SPARK_TESTING')
> and:  
> https://amplab.cs.berkeley.edu/jenkins/job/testing-foo/10/injectedEnvVars/
> this is problematic, as some of our lab builds are attempting to run pyspark 
> as part of the build process, and the hard-coded checks for SPARK_TESTING in 
> the setup scripts are causing hard failures.
> see:  
> https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/2440/HADOOP_VERSION=2.6.2,SCALAVER=2.11,SPARK_VERSION=2.2.0,label=centos/consoleFull
> i would strongly suggest that we do the following:
> * remove the SPARK_TESTING environment variable declaration in the jenkins 
> config
> * add the environment variable to each spark build config in github:  
> https://github.com/databricks/spark-jenkins-configurations/
> * add the environment variable to SparkPullRequstBuilder and 
> NewSparkPullRequestBuilder



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22325) SPARK_TESTING env variable breaking non-spark builds on amplab jenkins

2017-10-20 Thread shane knapp (JIRA)

shane knapp created SPARK-22325:
---

 Summary: SPARK_TESTING env variable breaking non-spark builds on 
amplab jenkins
 Key: SPARK-22325
 URL: https://issues.apache.org/jira/browse/SPARK-22325
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Affects Versions: 2.2.0
 Environment: riselab jenkins, all workers (ubuntu & centos)
Reporter: shane knapp
Priority: Critical


in the riselab jenkins master config, the SPARK_TESTING environment variable is 
set to 1 and applied to all workers.

see:  
https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-foo/9/console

(the 'echo 1' is actually 'echo $SPARK_TESTING')

this is problematic, as some of our lab builds are attempting to run pyspark as 
part of the build process, and the hard-coded checks for SPARK_TESTING in the 
setup scripts are causing hard failures.

see:  
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/2440/HADOOP_VERSION=2.6.2,SCALAVER=2.11,SPARK_VERSION=2.2.0,label=centos/consoleFull

i would strongly suggest that we do the following:
* remove the SPARK_TESTING environment variable declaration in the jenkins 
config
* add the environment variable to each spark build config in github:  
https://github.com/databricks/spark-jenkins-configurations/
* add the environment variable to SparkPullRequstBuilder and 
NewSparkPullRequestBuilder





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21750) Use arrow 0.6.0

2017-10-20 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-21750.
-

SPARK-22324 supercedes this.

> Use arrow 0.6.0
> ---
>
> Key: SPARK-21750
> URL: https://issues.apache.org/jira/browse/SPARK-21750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been 
> released, use the latest one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21750) Use arrow 0.6.0

2017-10-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213016#comment-16213016
 ] 

Dongjoon Hyun commented on SPARK-21750:
---

Great! Thanks, [~bryanc]. I'll monitor SPARK-22324.

> Use arrow 0.6.0
> ---
>
> Key: SPARK-21750
> URL: https://issues.apache.org/jira/browse/SPARK-21750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been 
> released, use the latest one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq

2017-10-20 Thread Randy Tidd (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213015#comment-16213015
 ] 

Randy Tidd edited comment on SPARK-22296 at 10/20/17 6:34 PM:
--

Thank you I was just installing 2.2.0 and related components to try this but 
you beat me to it.  Glad to hear it's fixed. Note that the example provided 
above by Liang-Chi Hsieh does not exhibit the problem in 2.1.0 or 2.1.1.


was (Author: tiddman):
Thank you I was just installing 2.2.0 and related components to try this but 
you beat me to it.  Glad to hear it's fixed.

> CodeGenerator - failed to compile when constructor has 
> scala.collection.mutable.Seq vs. scala.collection.Seq
> 
>
> Key: SPARK-22296
> URL: https://issues.apache.org/jira/browse/SPARK-22296
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Randy Tidd
>
> This is with Scala 2.11.
> We have a case class that has a constructor with 85 args, the last two of 
> which are:
>  var chargesInst : 
> scala.collection.mutable.Seq[ChargeInstitutional] = 
> scala.collection.mutable.Seq.empty[ChargeInstitutional],
>  var chargesProf : 
> scala.collection.mutable.Seq[ChargeProfessional] = 
> scala.collection.mutable.Seq.empty[ChargeProfessional]
> A mutable Seq in a the constructor of a case class is probably poor form but 
> Scala allows it.  When we run this job we get this error:
> build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
> worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 8217, Column 70: No applicable constructor/method found for actual parameters 
> "java.lang.String, java.lang.String, long, java.lang.String, long, long, 
> long, java.lang.String, long, long, double, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, long, long, long, long, 
> scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, java.lang.String, int, double, 
> double, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
> com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
> java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
> scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, scala.collection.Seq"; candidates are: 
> "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
> java.lang.String, long, long, long, java.lang.String, long, long, double, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, long, scala.Option, scala.Option, scala.Option, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, int, double, double, java.lang.String, java.lang.String, 
> java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, long, long, long, long, java.lang.String, 
> com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, 
> scala.collection.Seq, scala.collection.Seq, java.lang.String, long, 
> java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, boolean, scala.collection.mutable.Seq, 
> scala.collection.mutable.Seq)"
> The relevant lines are:
> build   17-Oct-2017 05:30:50/* 093 */   private scala.collection.Seq 
> argValue84;
> build   17-Oct-2017 05:30:50/* 094 */   private scala.collection.Seq 
> argValue85;
> and
> build   17-Oct-2017 05:30:54/* 8217 */ final 
> com.xyz.xyz.xyz.domain.Account value1 = false ? null : new 
> com.xyz.xyz.xyz.domain.Account(argValue2, argValue3

[jira] [Commented] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq

2017-10-20 Thread Randy Tidd (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213015#comment-16213015
 ] 

Randy Tidd commented on SPARK-22296:


Thank you I was just installing 2.2.0 and related components to try this but 
you beat me to it.  Glad to hear it's fixed.

> CodeGenerator - failed to compile when constructor has 
> scala.collection.mutable.Seq vs. scala.collection.Seq
> 
>
> Key: SPARK-22296
> URL: https://issues.apache.org/jira/browse/SPARK-22296
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Randy Tidd
>
> This is with Scala 2.11.
> We have a case class that has a constructor with 85 args, the last two of 
> which are:
>  var chargesInst : 
> scala.collection.mutable.Seq[ChargeInstitutional] = 
> scala.collection.mutable.Seq.empty[ChargeInstitutional],
>  var chargesProf : 
> scala.collection.mutable.Seq[ChargeProfessional] = 
> scala.collection.mutable.Seq.empty[ChargeProfessional]
> A mutable Seq in a the constructor of a case class is probably poor form but 
> Scala allows it.  When we run this job we get this error:
> build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
> worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 8217, Column 70: No applicable constructor/method found for actual parameters 
> "java.lang.String, java.lang.String, long, java.lang.String, long, long, 
> long, java.lang.String, long, long, double, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, long, long, long, long, 
> scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, java.lang.String, int, double, 
> double, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
> com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
> java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
> scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, scala.collection.Seq"; candidates are: 
> "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
> java.lang.String, long, long, long, java.lang.String, long, long, double, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, long, scala.Option, scala.Option, scala.Option, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, int, double, double, java.lang.String, java.lang.String, 
> java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, long, long, long, long, java.lang.String, 
> com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, 
> scala.collection.Seq, scala.collection.Seq, java.lang.String, long, 
> java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, boolean, scala.collection.mutable.Seq, 
> scala.collection.mutable.Seq)"
> The relevant lines are:
> build   17-Oct-2017 05:30:50/* 093 */   private scala.collection.Seq 
> argValue84;
> build   17-Oct-2017 05:30:50/* 094 */   private scala.collection.Seq 
> argValue85;
> and
> build   17-Oct-2017 05:30:54/* 8217 */ final 
> com.xyz.xyz.xyz.domain.Account value1 = false ? null : new 
> com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, 
> argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, 
> argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, 
> argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, 
> argValue24, argValue25, argValue26, argValue27, argValue

[jira] [Commented] (SPARK-21750) Use arrow 0.6.0

2017-10-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213012#comment-16213012
 ] 

Bryan Cutler commented on SPARK-21750:
--

Thanks [~dongjoon], I opened SPARK-22324 under the Arrow epic so we can start 
discussing an upgrade to 0.8.0 which is slated for early Nov.

> Use arrow 0.6.0
> ---
>
> Key: SPARK-21750
> URL: https://issues.apache.org/jira/browse/SPARK-21750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been 
> released, use the latest one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22324) Upgrade Arrow to version 0.8.0

2017-10-20 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-22324:


 Summary: Upgrade Arrow to version 0.8.0
 Key: SPARK-22324
 URL: https://issues.apache.org/jira/browse/SPARK-22324
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Bryan Cutler


Arrow version 0.8.0 is slated for release in early November, but I'd like to 
start discussing to help get all the work that's being done synced up.

Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test envs 
will need to be upgraded as well that will take a fair amount of work and 
planning.

One topic I'd like to discuss is if pyarrow should be an installation 
requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
install pyarrow.  If not, then is there a minimum version that needs to be 
supported?  We currently have 0.4.1 installed on Jenkins.

There are a number of improvements and cleanups in the current code that can 
happen depending on what we decide (I'll link them all here later, but off the 
top of my head):

* Decimal bug fix and improved support
* Improved internal casting between pyarrow and pandas (can clean up some 
workarounds)
* Better type checking when converting Spark types to Arrow
* Timestamp conversion to microseconds (for Spark internal format)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq

2017-10-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212999#comment-16212999
 ] 

Dongjoon Hyun commented on SPARK-22296:
---

[~tiddman]. Please refer the following in 2.2. This is resolved as mentioned in 
the above comments.
{code}
scala> case class Foo1(x: Int, s: String, seq: 
scala.collection.mutable.Seq[Int] = scala.collection.mutable.Seq.empty[Int])
defined class Foo1

scala> val ds1 = Seq(Foo1(1, "a"), Foo1(2, "b")).toDS
ds1: org.apache.spark.sql.Dataset[Foo1] = [x: int, s: string ... 1 more field]

scala> case class Foo2(x: Int, s: String)
defined class Foo2

scala> val ds2 = Seq(Foo2(1, "aa"), Foo2(3, "cc")).toDS
ds2: org.apache.spark.sql.Dataset[Foo2] = [x: int, s: string]

scala> ds1.joinWith(ds2, ds1.col("x") === ds2.col("x")).collect()
res0: Array[(Foo1, Foo2)] = Array((Foo1(1,a,ArrayBuffer()),Foo2(1,aa)))

scala> spark.version
res1: String = 2.2.0
{code}

> CodeGenerator - failed to compile when constructor has 
> scala.collection.mutable.Seq vs. scala.collection.Seq
> 
>
> Key: SPARK-22296
> URL: https://issues.apache.org/jira/browse/SPARK-22296
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Randy Tidd
>
> This is with Scala 2.11.
> We have a case class that has a constructor with 85 args, the last two of 
> which are:
>  var chargesInst : 
> scala.collection.mutable.Seq[ChargeInstitutional] = 
> scala.collection.mutable.Seq.empty[ChargeInstitutional],
>  var chargesProf : 
> scala.collection.mutable.Seq[ChargeProfessional] = 
> scala.collection.mutable.Seq.empty[ChargeProfessional]
> A mutable Seq in a the constructor of a case class is probably poor form but 
> Scala allows it.  When we run this job we get this error:
> build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
> worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 8217, Column 70: No applicable constructor/method found for actual parameters 
> "java.lang.String, java.lang.String, long, java.lang.String, long, long, 
> long, java.lang.String, long, long, double, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, long, long, long, long, 
> scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, java.lang.String, int, double, 
> double, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
> com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
> java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
> scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, scala.collection.Seq"; candidates are: 
> "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
> java.lang.String, long, long, long, java.lang.String, long, long, double, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, long, scala.Option, scala.Option, scala.Option, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, int, double, double, java.lang.String, java.lang.String, 
> java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, long, long, long, long, java.lang.String, 
> com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, 
> scala.collection.Seq, scala.collection.Seq, java.lang.String, long, 
> java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, boolean, scala.collection.mutable.Seq, 
> scala.collection.mutable.Seq)"
> The relevant lines are:
> build   17-Oct-2017 05:30:50

[jira] [Comment Edited] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq

2017-10-20 Thread Randy Tidd (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212983#comment-16212983
 ] 

Randy Tidd edited comment on SPARK-22296 at 10/20/17 6:17 PM:
--

The example above does not exhibit the problem.  Here is a concise example that 
does.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/10/20 14:09:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/10/20 14:09:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Spark context Web UI available at http://10.10.43.134:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1508522945163).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class Foo1(x: Int, s: String, seq: 
scala.collection.mutable.Seq[Int] = scala.collection.mutable.Seq.empty[Int])
defined class Foo1

scala> val ds1 = Seq(Foo1(1, "a"), Foo1(2, "b")).toDS
ds1: org.apache.spark.sql.Dataset[Foo1] = [x: int, s: string ... 1 more field]

scala> 

scala> case class Foo2(x: Int, s: String)
defined class Foo2

scala> val ds2 = Seq(Foo2(1, "aa"), Foo2(3, "cc")).toDS
ds2: org.apache.spark.sql.Dataset[Foo2] = [x: int, s: string]

scala> 

scala> ds1.joinWith(ds2, ds1.col("x") === ds2.col("x")).collect()
17/10/20 14:09:14 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
263, Column 69: No applicable constructor/method found for actual parameters 
"int, java.lang.String, scala.collection.Seq"; candidates are: 
"$line14.$read$$iw$$iw$Foo1(int, java.lang.String, 
scala.collection.mutable.Seq)"
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private $line14.$read$$iw$$iw$Foo1 argValue;
/* 010 */   private $line16.$read$$iw$$iw$Foo2 argValue1;
/* 011 */   private int argValue2;
/* 012 */   private java.lang.String argValue3;
/* 013 */   private scala.collection.Seq argValue4;
/* 014 */   private boolean resultIsNull;
/* 015 */   private java.lang.Object[] argValue5;
/* 016 */   private boolean MapObjects_loopIsNull1;
/* 017 */   private int MapObjects_loopValue0;
/* 018 */   private int argValue6;
/* 019 */   private java.lang.String argValue7;
/* 020 */   private boolean isNull31;
/* 021 */   private boolean value31;
/* 022 */   private boolean isNull32;
/* 023 */   private $line16.$read$$iw$$iw$Foo2 value32;
/* 024 */   private boolean isNull33;
/* 025 */   private $line16.$read$$iw$$iw$Foo2 value33;
/* 026 */
/* 027 */   public SpecificSafeProjection(Object[] references) {
/* 028 */ this.references = references;
/* 029 */ mutableRow = (InternalRow) references[references.length - 1];
/* 030 */
/* 031 */
/* 032 */
/* 033 */
/* 034 */
/* 035 */
/* 036 */
/* 037 */
/* 038 */
/* 039 */
/* 040 */
/* 041 */ isNull31 = false;
/* 042 */ value31 = false;
/* 043 */ isNull32 = false;
/* 044 */ value32 = null;
/* 045 */ isNull33 = false;
/* 046 */ value33 = null;
/* 047 */
/* 048 */   }
/* 049 */
/* 050 */   public void initialize(int partitionIndex) {
/* 051 */
/* 052 */   }
/* 053 */
/* 054 */
/* 055 */   private void evalIfTrueExpr(InternalRow i) {
/* 056 */ final $line16.$read$$iw$$iw$Foo2 value22 = null;
/* 057 */ isNull32 = true;
/* 058 */ value32 = value22;
/* 059 */   }
/* 060 */
/* 061 */
/* 062 */   private void apply_1(InternalRow i) {
/* 063 */
/* 064 */
/* 065 */ resultIsNull = false;
/* 066 */ if (!resultIsNull) {
/* 067 */
/* 068 */   InternalRow value16 = i.getStruct(0, 3);
/* 069 */   boolean isNull15 = false;
/* 070 */   ArrayData value15 = null;
/* 071 */
/* 072 */
/* 073 */   if (value16.isNullAt(2)) {
/* 074 */ isNull15 = true;
/* 075 */   } else {
/* 076 */ value15 = value16.getArray(2);
/* 077 */   }
/* 078 */   ArrayData value14 = null;
/* 079 */
/* 080 */   if (!isNull15) {
/* 081 */
/* 082 */ Integer[] convertedArray = null;
/* 083 */ int dataLength = value15.numElements();
/* 084 */ convertedArray = new Integer[dataLength];
/* 085 */
/*

[jira] [Commented] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq

2017-10-20 Thread Randy Tidd (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212983#comment-16212983
 ] 

Randy Tidd commented on SPARK-22296:


The example above does not exhibit the problem.  Here is a concise example that 
does.

{quote}% spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/10/20 14:09:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/10/20 14:09:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Spark context Web UI available at http://10.10.43.134:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1508522945163).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class Foo1(x: Int, s: String, seq: 
scala.collection.mutable.Seq[Int] = scala.collection.mutable.Seq.empty[Int])
defined class Foo1

scala> val ds1 = Seq(Foo1(1, "a"), Foo1(2, "b")).toDS
ds1: org.apache.spark.sql.Dataset[Foo1] = [x: int, s: string ... 1 more field]

scala> case class Foo2(x: Int, s: String)
defined class Foo2

scala> val ds2 = Seq(Foo2(1, "aa"), Foo2(3, "cc")).toDS
ds2: org.apache.spark.sql.Dataset[Foo2] = [x: int, s: string]

scala> ds1.joinWith(ds2, ds1.col("x") === ds2.col("x")).collect()
17/10/20 14:09:14 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
263, Column 69: No applicable constructor/method found for actual parameters 
"int, java.lang.String, scala.collection.Seq"; candidates are: 
"$line14.$read$$iw$$iw$Foo1(int, java.lang.String, 
scala.collection.mutable.Seq)"
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private $line14.$read$$iw$$iw$Foo1 argValue;
/* 010 */   private $line16.$read$$iw$$iw$Foo2 argValue1;
/* 011 */   private int argValue2;
/* 012 */   private java.lang.String argValue3;
/* 013 */   private scala.collection.Seq argValue4;
/* 014 */   private boolean resultIsNull;
/* 015 */   private java.lang.Object[] argValue5;
/* 016 */   private boolean MapObjects_loopIsNull1;
/* 017 */   private int MapObjects_loopValue0;
/* 018 */   private int argValue6;
/* 019 */   private java.lang.String argValue7;
/* 020 */   private boolean isNull31;
/* 021 */   private boolean value31;
/* 022 */   private boolean isNull32;
/* 023 */   private $line16.$read$$iw$$iw$Foo2 value32;
/* 024 */   private boolean isNull33;
/* 025 */   private $line16.$read$$iw$$iw$Foo2 value33;
/* 026 */
/* 027 */   public SpecificSafeProjection(Object[] references) {
/* 028 */ this.references = references;
/* 029 */ mutableRow = (InternalRow) references[references.length - 1];
/* 030 */
/* 031 */
/* 032 */
/* 033 */
/* 034 */
/* 035 */
/* 036 */
/* 037 */
/* 038 */
/* 039 */
/* 040 */
/* 041 */ isNull31 = false;
/* 042 */ value31 = false;
/* 043 */ isNull32 = false;
/* 044 */ value32 = null;
/* 045 */ isNull33 = false;
/* 046 */ value33 = null;
/* 047 */
/* 048 */   }
/* 049 */
/* 050 */   public void initialize(int partitionIndex) {
/* 051 */
/* 052 */   }
/* 053 */
/* 054 */
/* 055 */   private void evalIfTrueExpr(InternalRow i) {
/* 056 */ final $line16.$read$$iw$$iw$Foo2 value22 = null;
/* 057 */ isNull32 = true;
/* 058 */ value32 = value22;
/* 059 */   }
/* 060 */
/* 061 */
/* 062 */   private void apply_1(InternalRow i) {
/* 063 */
/* 064 */
/* 065 */ resultIsNull = false;
/* 066 */ if (!resultIsNull) {
/* 067 */
/* 068 */   InternalRow value16 = i.getStruct(0, 3);
/* 069 */   boolean isNull15 = false;
/* 070 */   ArrayData value15 = null;
/* 071 */
/* 072 */
/* 073 */   if (value16.isNullAt(2)) {
/* 074 */ isNull15 = true;
/* 075 */   } else {
/* 076 */ value15 = value16.getArray(2);
/* 077 */   }
/* 078 */   ArrayData value14 = null;
/* 079 */
/* 080 */   if (!isNull15) {
/* 081 */
/* 082 */ Integer[] convertedArray = null;
/* 083 */ int dataLength = value15.numElements();
/* 084 */ convertedArray = new Integer[dataLength];
/* 085 */
/* 086 */ int loopIndex = 0;
/* 087 */

[jira] [Commented] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2017-10-20 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212962#comment-16212962
 ] 

Joseph K. Bradley commented on SPARK-21926:
---

One more report from the dev list: HashingTF and IDFModel fail with Structured 
Streaming: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HashingTFModel-IDFModel-in-Structured-Streaming-td22680.html

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> a) Re-design vectorUDT metadata to support missing metadata for some 
> elements. (This might be a good thing to do anyways SPARK-19141)
> b) drop metadata in streaming context.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2017-10-20 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21926:
--
Shepherd: Joseph K. Bradley

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> a) Re-design vectorUDT metadata to support missing metadata for some 
> elements. (This might be a good thing to do anyways SPARK-19141)
> b) drop metadata in streaming context.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2017-10-20 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21926:
--
Summary: Compatibility between ML Transformers and Structured Streaming  
(was: Some transformers in spark.ml.feature fail when trying to transform 
streaming dataframes)

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> a) Re-design vectorUDT metadata to support missing metadata for some 
> elements. (This might be a good thing to do anyways SPARK-19141)
> b) drop metadata in streaming context.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2017-10-20 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21926:
--
Issue Type: Umbrella  (was: Bug)

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> a) Re-design vectorUDT metadata to support missing metadata for some 
> elements. (This might be a good thing to do anyways SPARK-19141)
> b) drop metadata in streaming context.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene

2017-10-20 Thread Michael Mior (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212932#comment-16212932
 ] 

Michael Mior commented on SPARK-19007:
--

I see from the following statement from the PR discussion, but I don't 
understand why this causes a problem.

bq. it had to do with the fact that RDDs may be materialized later than 
checkpointer.update() gets called.

> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> 
>
> Key: SPARK-19007
> URL: https://issues.apache.org/jira/browse/SPARK-19007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
> Environment: A CDH cluster consists of 3 redhat server ,(120G 
> memory、40 cores、43TB disk per server).
>Reporter: zhangdenghui
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Test data：80G CTR training data from 
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
>  ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
> generated continuous features，the way to generate the new features refers to 
> the way mentioned in the xgboost's paper.
> Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
> executor.
> Parameters: numIterations 10, maxdepth  8,   the rest parameters are default
> I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
> mentioned above.
> It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
> rounds later.Without these task failures and task retry it can be much faster 
> ,which can save about half the time. I think it's caused by the RDD named 
> predError in the while loop of  the boost method at 
> GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
> growing after every GBT round, and then it caused failures like this :
> (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 
> GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.).  
> I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
> needed is too much (even increase half the memory  can't solve the problem) 
> so i think it's not a proper method. 
> Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
> the lineage  but it increases IO cost a lot. 
> I tried  another way to solve this problem.I persisted the RDD named 
> predError every round  and use  pre_predError to record the previous RDD  and 
> unpersist it  because it's useless anymore.
> Finally it costs about 45 min after i tried my method and no task failure 
> occured and no more memeory added. 
> So when the data is much larger than memory, my little improvement can 
> speedup  the  GradientBoostedTrees  1~2 times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene

2017-10-20 Thread Michael Mior (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212924#comment-16212924
 ] 

Michael Mior commented on SPARK-19007:
--

[~josephkb] I'm confused why it's necessary to cache more than one RDD in the 
queue. It seems like there should never be a need for data from previous 
iterations if there's enough memory to cache the previous iteration. And if 
there isn't, trying to cache even more data seems like it would just make 
things worse.

> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> 
>
> Key: SPARK-19007
> URL: https://issues.apache.org/jira/browse/SPARK-19007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
> Environment: A CDH cluster consists of 3 redhat server ,(120G 
> memory、40 cores、43TB disk per server).
>Reporter: zhangdenghui
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Test data：80G CTR training data from 
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
>  ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
> generated continuous features，the way to generate the new features refers to 
> the way mentioned in the xgboost's paper.
> Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
> executor.
> Parameters: numIterations 10, maxdepth  8,   the rest parameters are default
> I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
> mentioned above.
> It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
> rounds later.Without these task failures and task retry it can be much faster 
> ,which can save about half the time. I think it's caused by the RDD named 
> predError in the while loop of  the boost method at 
> GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
> growing after every GBT round, and then it caused failures like this :
> (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 
> GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.).  
> I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
> needed is too much (even increase half the memory  can't solve the problem) 
> so i think it's not a proper method. 
> Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
> the lineage  but it increases IO cost a lot. 
> I tried  another way to solve this problem.I persisted the RDD named 
> predError every round  and use  pre_predError to record the previous RDD  and 
> unpersist it  because it's useless anymore.
> Finally it costs about 45 min after i tried my method and no task failure 
> occured and no more memeory added. 
> So when the data is much larger than memory, my little improvement can 
> speedup  the  GradientBoostedTrees  1~2 times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22209) PySpark does not recognize imports from submodules

2017-10-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212890#comment-16212890
 ] 

Bryan Cutler edited comment on SPARK-22209 at 10/20/17 5:01 PM:


It does seem like a bug to me so it should be fixed, I suspect it is in Sparks 
cloudpickle module.  Would you be able to submit a patch for this?  If not, I 
can try to take a look.  cc [~holden.ka...@gmail.com]


was (Author: bryanc):
It does seem like a bug to me so it should be fixed, I suspect it is in Sparks 
cloudpickle module.  Would you be able to submit a patch for this?  If not, I 
can try to take a look.

> PySpark does not recognize imports from submodules
> --
>
> Key: SPARK-22209
> URL: https://issues.apache.org/jira/browse/SPARK-22209
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0
> Environment: Anaconda 4.4.0, Python 3.6, Hadoop 2.7, CDH 5.3.3, JDK 
> 1.8, Centos 6
>Reporter: Joel Croteau
>Priority: Minor
>
> Using submodule syntax inside a PySpark job seems to create issues. For 
> example, the following:
> {code:python}
> import scipy.sparse
> from pyspark import SparkContext, SparkConf
> def do_stuff(x):
> y = scipy.sparse.dok_matrix((1, 1))
> y[0, 0] = x
> return y[0, 0]
> def init_context():
> conf = SparkConf().setAppName("Spark Test")
> sc = SparkContext(conf=conf)
> return sc
> def main():
> sc = init_context()
> data = sc.parallelize([1, 2, 3, 4])
> output_data = data.map(do_stuff)
> print(output_data.collect())
> __name__ == '__main__' and main()
> {code}
> produces this error:
> {noformat}
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
> at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)
> at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.r

[jira] [Commented] (SPARK-22209) PySpark does not recognize imports from submodules

2017-10-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212890#comment-16212890
 ] 

Bryan Cutler commented on SPARK-22209:
--

It does seem like a bug to me so it should be fixed, I suspect it is in Sparks 
cloudpickle module.  Would you be able to submit a patch for this?  If not, I 
can try to take a look.

> PySpark does not recognize imports from submodules
> --
>
> Key: SPARK-22209
> URL: https://issues.apache.org/jira/browse/SPARK-22209
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0
> Environment: Anaconda 4.4.0, Python 3.6, Hadoop 2.7, CDH 5.3.3, JDK 
> 1.8, Centos 6
>Reporter: Joel Croteau
>Priority: Minor
>
> Using submodule syntax inside a PySpark job seems to create issues. For 
> example, the following:
> {code:python}
> import scipy.sparse
> from pyspark import SparkContext, SparkConf
> def do_stuff(x):
> y = scipy.sparse.dok_matrix((1, 1))
> y[0, 0] = x
> return y[0, 0]
> def init_context():
> conf = SparkConf().setAppName("Spark Test")
> sc = SparkContext(conf=conf)
> return sc
> def main():
> sc = init_context()
> data = sc.parallelize([1, 2, 3, 4])
> output_data = data.map(do_stuff)
> print(output_data.collect())
> __name__ == '__main__' and main()
> {code}
> produces this error:
> {noformat}
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
> at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)
> at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/home/matt/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",
>  line 177, in m

[jira] [Resolved] (SPARK-21055) Support grouping__id

2017-10-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21055.
-
   Resolution: Fixed
 Assignee: cen yuhai
Fix Version/s: 2.3.0

> Support grouping__id
> 
>
> Key: SPARK-21055
> URL: https://issues.apache.org/jira/browse/SPARK-21055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark2.1.1
>Reporter: cen yuhai
>Assignee: cen yuhai
> Fix For: 2.3.0
>
>
> Now, spark doesn't support grouping__id, spark provide another function 
> grouping_id() to workaround. 
> If use grouping_id(), many scripts need to change and supporting  
> grouping__id is very easy, why not?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22323) Design doc for different types of pandas_udf

2017-10-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212778#comment-16212778
 ] 

Apache Spark commented on SPARK-22323:
--

User 'icexelloss' has created a pull request for this issue:
https://github.com/apache/spark/pull/19544

> Design doc for different types of pandas_udf
> 
>
> Key: SPARK-22323
> URL: https://issues.apache.org/jira/browse/SPARK-22323
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>Assignee: Apache Spark
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22323) Design doc for different types of pandas_udf

2017-10-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22323:


Assignee: Apache Spark

> Design doc for different types of pandas_udf
> 
>
> Key: SPARK-22323
> URL: https://issues.apache.org/jira/browse/SPARK-22323
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>Assignee: Apache Spark
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22323) Design doc for different types of pandas_udf

2017-10-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22323:


Assignee: (was: Apache Spark)

> Design doc for different types of pandas_udf
> 
>
> Key: SPARK-22323
> URL: https://issues.apache.org/jira/browse/SPARK-22323
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19606) Support constraints in spark-dispatcher

2017-10-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212770#comment-16212770
 ] 

Apache Spark commented on SPARK-19606:
--

User 'pmackles' has created a pull request for this issue:
https://github.com/apache/spark/pull/19543

> Support constraints in spark-dispatcher
> ---
>
> Key: SPARK-19606
> URL: https://issues.apache.org/jira/browse/SPARK-19606
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Philipp Hoffmann
>
> The `spark.mesos.constraints` configuration is ignored by the 
> spark-dispatcher. The constraints need to be passed in the Framework 
> information when registering with Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22323) Design doc for different types of pandas_udf

2017-10-20 Thread Li Jin (JIRA)

Li Jin created SPARK-22323:
--

 Summary: Design doc for different types of pandas_udf
 Key: SPARK-22323
 URL: https://issues.apache.org/jira/browse/SPARK-22323
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Li Jin
 Fix For: 2.3.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21750) Use arrow 0.6.0

2017-10-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212741#comment-16212741
 ] 

Dongjoon Hyun commented on SPARK-21750:
---

Hi, [~kiszk].

Two more Arrow releases seem to be out. How about the Python side? Can we catch 
up some?
- 0.7.1 (1 October 2017)
- 0.7.0 (17 September 2017)

> Use arrow 0.6.0
> ---
>
> Key: SPARK-21750
> URL: https://issues.apache.org/jira/browse/SPARK-21750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been 
> released, use the latest one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21750) Use arrow 0.6.0

2017-10-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212740#comment-16212740
 ] 

Apache Spark commented on SPARK-21750:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/18974

> Use arrow 0.6.0
> ---
>
> Key: SPARK-21750
> URL: https://issues.apache.org/jira/browse/SPARK-21750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Since [Arrow 0.6.0|http://arrow.apache.org/release/0.6.0.html] has been 
> released, use the latest one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22322) Update FutureAction for compatibility with Scala 2.12 future

2017-10-20 Thread Sean Owen (JIRA)

Sean Owen created SPARK-22322:
-

 Summary: Update FutureAction for compatibility with Scala 2.12 
future
 Key: SPARK-22322
 URL: https://issues.apache.org/jira/browse/SPARK-22322
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


The latest item on the road to Scala 2.12 is a change in Future, which adds new 
transform and transformWith method. The current stub implementation in place 
doesn't work, as it ends up being used by Scala APIs.

The change is simple, to implement pass-through implementations in 
SimpleFutureAction, ComplexFutureAction, while making sure the method 
definitions compile in Scala 2.11 (where they'd be unused)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212640#comment-16212640
 ] 

Dongjoon Hyun commented on SPARK-22320:
---

Thank you, [~hyukjin.kwon]!

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212638#comment-16212638
 ] 

Hyukjin Kwon commented on SPARK-22320:
--

{quote}
Is this ORC only issue? What about Parquet?
{quote}

This looks also happens in JSON too:

{code}
import org.apache.spark.ml.linalg._

val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, Array(4), 
Array(1.0
val df = data.toDF("i", "vec")
df.write.json("/tmp/foo")
spark.read.json("/tmp/foo").schema
{code}

{code}
res1: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,LongType,true), 
StructField(vec,StructType(StructField(indices,ArrayType(LongType,true),true), 
StructField(size,LongType,true), StructField(type,LongType,true), 
StructField(values,ArrayType(DoubleType,true),true)),true))
{code}


Parquet seems fine:


{code}
import org.apache.spark.ml.linalg._

val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, Array(4), 
Array(1.0
val df = data.toDF("i", "vec")
df.write.parquet("/tmp/bar")
spark.read.parquet("/tmp/bar").schema
{code}

{code}
res1: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), 
StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
{code}

{quote}
Is this a regression in master branch? Could you check official Spark 2.2.0?
{quote}

This looks not a regression. I checked master too and it still happens.

Will take a closer look soon.

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212637#comment-16212637
 ] 

Dongjoon Hyun commented on SPARK-22320:
---

Thank you, [~podongfeng]. I checked that and updated this JIRA info.

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212629#comment-16212629
 ] 

zhengruifeng commented on SPARK-22320:
--

I test on spark 2.2.0.
parquet works fine.
I have not try MatrixUDT.

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212624#comment-16212624
 ] 

zhengruifeng commented on SPARK-22320:
--

I test on spark 2.2.0.
parquet works fine.
I have not try MatrixUDT.
 


-- 原始邮件 --
发件人: "Dongjoon Hyun (JIRA)" ;
发送时间: 2017年10月20日(星期五) 19:34
收件人: "ruifengz" ;
主题: [jira] [Commented] (SPARK-22320) ORC should supportVectorUDT/MatrixUDT




[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212507#comment-16212507
 ] 

Dongjoon Hyun commented on SPARK-22320:
---

Thank you for reporting, [~podongfeng].
- Is this ORC only issue? What about Parquet?
- Is this a regression in master branch? Could you check official Spark 2.2.0?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22305) HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state

2017-10-20 Thread Yuval Itzchakov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212608#comment-16212608
 ] 

Yuval Itzchakov commented on SPARK-22305:
-

[~tdas] What do you think?

> HDFSBackedStateStoreProvider fails with StackOverflowException when 
> attempting to recover state
> ---
>
> Key: SPARK-22305
> URL: https://issues.apache.org/jira/browse/SPARK-22305
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuval Itzchakov
>
> Environment:
> Spark: 2.2.0
> Java version: 1.8.0_112
> spark.sql.streaming.minBatchesToRetain: 100
> After an application failure due to OOM exceptions, restarting the 
> application with the existing state produces the following OOM:
> {code:java}
> java.io.IOException: com.google.protobuf.ServiceException: 
> java.lang.StackOverflowError
>   at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
>   at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streamin

[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212601#comment-16212601
 ] 

Dongjoon Hyun commented on SPARK-22320:
---

Ping [~hyukjin.kwon] since you worked on SPARK-17765.

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212581#comment-16212581
 ] 

Dongjoon Hyun commented on SPARK-22320:
---

I checked that 2.2.0 and 2.1.2 has the same problem and 2.0.2 fails with the 
following exception.
{code}
scala> df.write.mode("overwrite").orc("/tmp/o3")
17/10/20 05:34:42 ERROR Utils: Aborting task
java.lang.ClassCastException: org.apache.spark.ml.linalg.VectorUDT cannot be 
cast to org.apache.spark.sql.types.StructType
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:558)
{code}

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22320:
--
Affects Version/s: 2.0.2

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22320:
--
Affects Version/s: 2.1.2

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22320:
--
Affects Version/s: (was: 2.3.0)

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22320:
--
Affects Version/s: 2.2.0

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22321) Improve Logging in the mesos scheduler

2017-10-20 Thread Stavros Kontopoulos (JIRA)

Stavros Kontopoulos created SPARK-22321:
---

 Summary: Improve Logging in the mesos scheduler
 Key: SPARK-22321
 URL: https://issues.apache.org/jira/browse/SPARK-22321
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Stavros Kontopoulos


There are certain places where logging must be improved. For example when 
offers are rescinded or tasks were not launched due to unmet constraints in 
MesosCoarseGrainedSchedulerBackend. There is valuable info at the mesos layer 
we could expose to get a better understanding of what is happening in one place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22321) Improve logging in the mesos scheduler

2017-10-20 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22321:

Summary: Improve logging in the mesos scheduler  (was: Improve Logging in 
the mesos scheduler)

> Improve logging in the mesos scheduler
> --
>
> Key: SPARK-22321
> URL: https://issues.apache.org/jira/browse/SPARK-22321
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>
> There are certain places where logging must be improved. For example when 
> offers are rescinded or tasks were not launched due to unmet constraints in 
> MesosCoarseGrainedSchedulerBackend. There is valuable info at the mesos layer 
> we could expose to get a better understanding of what is happening in one 
> place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

2017-10-20 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-18935:

Component/s: Mesos

> Use Mesos "Dynamic Reservation" resource for Spark
> --
>
> Key: SPARK-18935
> URL: https://issues.apache.org/jira/browse/SPARK-18935
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: jackyoh
>
> I'm running spark on Apache Mesos
> Please follow these steps to reproduce the issue:
> 1. First, run Mesos resource reserve:
> curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d 
> resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]'
>  -X POST http://192.168.1.118:5050/master/reserve
> 2. Then run spark-submit command:
> ./spark-submit --class org.apache.spark.examples.SparkPi --master 
> mesos://192.168.1.118:5050 --conf spark.mesos.role=spark  
> ../examples/jars/spark-examples_2.11-2.0.2.jar 1
> And the console will keep loging same warning message as shown below: 
> 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin

2017-10-20 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212522#comment-16212522
 ] 

Bartosz Mścichowski commented on SPARK-21492:
-

Here's a script that exposes memory leak during SortMergeJoin in Spark 2.2.0, 
maybe it will be helpful.
Memory leak happens when the following code is executed in spark-shell (a local 
one). {{--conf spark.sql.autoBroadcastJoinThreshold=-1}} may be needed to 
ensure proper join type.

{noformat}
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val table1Key = "t1_key"
val table1Value = "t1_value"

val table2Key = "t2_key"
val table2Value = "t2_value"

val table1Schema = StructType(List(
StructField(table1Key, IntegerType),
StructField(table1Value, DoubleType)
));

val table2Schema = StructType(List(
StructField(table2Key, IntegerType),
StructField(table2Value, DoubleType)
));

val table1 = spark.sqlContext.createDataFrame(
rowRDD = spark.sparkContext.parallelize(Seq(
Row(1, 2.0)
)),
schema = table1Schema
);

val table2 = spark.sqlContext.createDataFrame(
rowRDD = spark.sparkContext.parallelize(Seq(
Row(1, 4.0)
)),
schema = table2Schema
);


val t1 = table1.repartition(col(table1Key)).groupBy(table1Key).avg()
val t2 = table2.repartition(col(table2Key)).groupBy(table2Key).avg()

val joinedDF = t1 join t2 where t1(table1Key) === t2(table2Key)

joinedDF.explain()
// == Physical Plan ==
// *SortMergeJoin [t1_key#2], [t2_key#9], Inner
// :- *Sort [t1_key#2 ASC NULLS FIRST], false, 0
// :  +- *HashAggregate(keys=[t1_key#2], functions=[avg(cast(t1_key#2 as 
bigint)), avg(t1_value#3)])
// : +- *HashAggregate(keys=[t1_key#2], 
functions=[partial_avg(cast(t1_key#2 as bigint)), partial_avg(t1_value#3)])
// :+- Exchange hashpartitioning(t1_key#2, 200)
// :   +- *Filter isnotnull(t1_key#2)
// :  +- Scan ExistingRDD[t1_key#2,t1_value#3]
// +- *Sort [t2_key#9 ASC NULLS FIRST], false, 0
//+- *HashAggregate(keys=[t2_key#9], functions=[avg(cast(t2_key#9 as 
bigint)), avg(t2_value#10)])
//   +- *HashAggregate(keys=[t2_key#9], 
functions=[partial_avg(cast(t2_key#9 as bigint)), partial_avg(t2_value#10)])
//  +- Exchange hashpartitioning(t2_key#9, 200)
// +- *Filter isnotnull(t2_key#9)
//+- Scan ExistingRDD[t2_key#9,t2_value#10]

joinedDF.show()
// The 'show' action yields a lot of:
// 17/10/19 08:17:39 WARN executor.Executor: Managed memory leak detected; size 
= 4194304 bytes, TID = 8
// 17/10/19 08:17:39 WARN executor.Executor: Managed memory leak detected; size 
= 4194304 bytes, TID = 9
// 17/10/19 08:17:39 WARN executor.Executor: Managed memory leak detected; size 
= 4194304 bytes, TID = 11
{noformat}


> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhan Zhang
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB

2017-10-20 Thread Ben (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212513#comment-16212513
 ] 

Ben commented on SPARK-22284:
-

[~kiszk], happy to help, and thanks for your assistance.

So, if I understand correctly, there was a similar ticket and it was solved, 
but my case is more complicated, right?
There are some cases in the files where the structures may indeed get very 
complex.
Would this case also be solvable?

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Ben
> Attachments: 64KB Error.log
>
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read..cache()
> dataFrameSmall.read...cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212507#comment-16212507
 ] 

Dongjoon Hyun commented on SPARK-22320:
---

Thank you for reporting, [~podongfeng].
- Is this ORC only issue? What about Parquet?
- Is this a regression in master branch? Could you check official Spark 2.2.0?

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22320:
-
Description: 
I save dataframe containing vectors in ORC format, when I read it back, the 
format is changed.
{code}
scala> import org.apache.spark.ml.linalg._
import org.apache.spark.ml.linalg._

scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
Array(4), Array(1.0
data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
(2,(8,[4],[1.0])))

scala> val df = data.toDF("i", "vec")
df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]

scala> df.schema
res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,false), 
StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

scala> df.write.orc("/tmp/123")

scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct]

scala> df2.schema
res3: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), 
StructField(vec,StructType(StructField(type,ByteType,true), 
StructField(size,IntegerType,true), 
StructField(indices,ArrayType(IntegerType,true),true), 
StructField(values,ArrayType(DoubleType,true),true)),true))
{code}


  was:
I save dataframe containing {{ml.Vector}}s in ORC format, when I read it back, 
the format is changed.
{code}
scala> import org.apache.spark.ml.linalg._
import org.apache.spark.ml.linalg._

scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
Array(4), Array(1.0
data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
(2,(8,[4],[1.0])))

scala> val df = data.toDF("i", "vec")
df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]

scala> df.schema
res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,false), 
StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

scala> df.write.orc("/tmp/123")

scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct]

scala> df2.schema
res3: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), 
StructField(vec,StructType(StructField(type,ByteType,true), 
StructField(size,IntegerType,true), 
StructField(indices,ArrayType(IntegerType,true),true), 
StructField(values,ArrayType(DoubleType,true),true)),true))
{code}



> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2017-10-20 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22320:
-
Summary: ORC should support VectorUDT/MatrixUDT  (was: ORC should support 
VectorUDF/MatrixUDT)

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> I save dataframe containing {{ml.Vector}}s in ORC format, when I read it 
> back, the format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22320) ORC should support VectorUDF/MatrixUDT

2017-10-20 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22320:
-
Description: 
I save dataframe containing {{ml.Vector}}s in ORC format, when I read it back, 
the format is changed.
{code}
scala> import org.apache.spark.ml.linalg._
import org.apache.spark.ml.linalg._

scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
Array(4), Array(1.0
data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
(2,(8,[4],[1.0])))

scala> val df = data.toDF("i", "vec")
df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]

scala> df.schema
res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,false), 
StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

scala> df.write.orc("/tmp/123")

scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct]

scala> df2.schema
res3: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), 
StructField(vec,StructType(StructField(type,ByteType,true), 
StructField(size,IntegerType,true), 
StructField(indices,ArrayType(IntegerType,true),true), 
StructField(values,ArrayType(DoubleType,true),true)),true))
{code}


  was:
I save dataframe containing ((ml.Vector}}s in ORC format, when I read it back, 
the format is changed.
{code}
scala> import org.apache.spark.ml.linalg._
import org.apache.spark.ml.linalg._

scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
Array(4), Array(1.0
data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
(2,(8,[4],[1.0])))

scala> val df = data.toDF("i", "vec")
df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]

scala> df.schema
res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,false), 
StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

scala> df.write.orc("/tmp/123")

scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct]

scala> df2.schema
res3: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), 
StructField(vec,StructType(StructField(type,ByteType,true), 
StructField(size,IntegerType,true), 
StructField(indices,ArrayType(IntegerType,true),true), 
StructField(values,ArrayType(DoubleType,true),true)),true))
{code}



> ORC should support VectorUDF/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> I save dataframe containing {{ml.Vector}}s in ORC format, when I read it 
> back, the format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22320) ORC should support VectorUDF/MatrixUDT

2017-10-20 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-22320:


 Summary: ORC should support VectorUDF/MatrixUDT
 Key: SPARK-22320
 URL: https://issues.apache.org/jira/browse/SPARK-22320
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: zhengruifeng


I save dataframe containing ((ml.Vector}}s in ORC format, when I read it back, 
the format is changed.
{code}
scala> import org.apache.spark.ml.linalg._
import org.apache.spark.ml.linalg._

scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
Array(4), Array(1.0
data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
(2,(8,[4],[1.0])))

scala> val df = data.toDF("i", "vec")
df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]

scala> df.schema
res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,false), 
StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

scala> df.write.orc("/tmp/123")

scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct]

scala> df2.schema
res3: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), 
StructField(vec,StructType(StructField(type,ByteType,true), 
StructField(size,IntegerType,true), 
StructField(indices,ArrayType(IntegerType,true),true), 
StructField(values,ArrayType(DoubleType,true),true)),true))
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine

2017-10-20 Thread Rob Vesse (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212462#comment-16212462
 ] 

Rob Vesse commented on SPARK-9:
---

[~yuvaldeg], thanks for providing additional clarifications

If a library is considered to be a standard part of the platform software then 
it should fall under the foundations standards 
[platform|http://www.apache.org/legal/resolved.html#platform] resolution that 
licensing of the platform does generally not affect the software running upon 
it. And if there are other Apache projects already depending on this that 
provides a precedent that Spark can rely on.

> SPIP: RDMA Accelerated Shuffle Engine
> -
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuval Degani
> Attachments: 
> SPARK-9_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>
>
> An RDMA-accelerated shuffle engine can provide enormous performance benefits 
> to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin 
> open-source project ([https://github.com/Mellanox/SparkRDMA]).
> Using RDMA for shuffle improves CPU utilization significantly and reduces I/O 
> processing overhead by bypassing the kernel and networking stack as well as 
> avoiding memory copies entirely. Those valuable CPU cycles are then consumed 
> directly by the actual Spark workloads, and help reducing the job runtime 
> significantly. 
> This performance gain is demonstrated with both industry standard HiBench 
> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive 
> customer applications. 
> SparkRDMA will be presented at Spark Summit 2017 in Dublin 
> ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]).
> Please see attached proposal document for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq

2017-10-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22296.
---
Resolution: Duplicate

> CodeGenerator - failed to compile when constructor has 
> scala.collection.mutable.Seq vs. scala.collection.Seq
> 
>
> Key: SPARK-22296
> URL: https://issues.apache.org/jira/browse/SPARK-22296
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Randy Tidd
>
> This is with Scala 2.11.
> We have a case class that has a constructor with 85 args, the last two of 
> which are:
>  var chargesInst : 
> scala.collection.mutable.Seq[ChargeInstitutional] = 
> scala.collection.mutable.Seq.empty[ChargeInstitutional],
>  var chargesProf : 
> scala.collection.mutable.Seq[ChargeProfessional] = 
> scala.collection.mutable.Seq.empty[ChargeProfessional]
> A mutable Seq in a the constructor of a case class is probably poor form but 
> Scala allows it.  When we run this job we get this error:
> build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
> worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 8217, Column 70: No applicable constructor/method found for actual parameters 
> "java.lang.String, java.lang.String, long, java.lang.String, long, long, 
> long, java.lang.String, long, long, double, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, long, long, long, long, 
> scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, java.lang.String, int, double, 
> double, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
> com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
> java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
> scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, scala.collection.Seq"; candidates are: 
> "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
> java.lang.String, long, long, long, java.lang.String, long, long, double, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, long, scala.Option, scala.Option, scala.Option, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, int, double, double, java.lang.String, java.lang.String, 
> java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, long, long, long, long, java.lang.String, 
> com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, 
> scala.collection.Seq, scala.collection.Seq, java.lang.String, long, 
> java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, boolean, scala.collection.mutable.Seq, 
> scala.collection.mutable.Seq)"
> The relevant lines are:
> build   17-Oct-2017 05:30:50/* 093 */   private scala.collection.Seq 
> argValue84;
> build   17-Oct-2017 05:30:50/* 094 */   private scala.collection.Seq 
> argValue85;
> and
> build   17-Oct-2017 05:30:54/* 8217 */ final 
> com.xyz.xyz.xyz.domain.Account value1 = false ? null : new 
> com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, 
> argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, 
> argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, 
> argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, 
> argValue24, argValue25, argValue26, argValue27, argValue28, argValue29, 
> argValue30, argValue31, argValue32, argValue33, argValue34, argValue35, 
> argValue36, argValue37, argValue38, argValue39, argValue40, a

[jira] [Resolved] (SPARK-22307) NOT condition working incorrectly

2017-10-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22307.
---
Resolution: Not A Problem

> NOT condition working incorrectly
> -
>
> Key: SPARK-22307
> URL: https://issues.apache.org/jira/browse/SPARK-22307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Andrey Yakovenko
> Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  ||-- Id: string (nullable = true)
>  ||-- Name: string (nullable = true)
>  ||-- Parent: struct (nullable = true)
>  |||-- Id: string (nullable = true)
>  |||-- Name: string (nullable = true)
>  |||-- Parent: struct (nullable = true)
>  ||||-- Id: string (nullable = true)
>  ||||-- Name: string (nullable = true)
>  ||||-- Parent: string (nullable = true)
>  ||||-- SKU: string (nullable = true)
>  |||-- SKU: string (nullable = true)
>  ||-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
> ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
> '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
> (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
> 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of 
> levels filled up. I have a suspicion that due to partly filled hierarchy 
> condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-10-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21551:
--
Fix Version/s: 2.1.3

> pyspark's collect fails when getaddrinfo is too slow
> 
>
> Key: SPARK-21551
> URL: https://issues.apache.org/jira/browse/SPARK-21551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: peay
>Assignee: peay
>Priority: Critical
> Fix For: 2.1.3, 2.2.1, 2.3.0
>
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
> {{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
> and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> {code}
> The server has **a hardcoded timeout of 3 seconds** 
> (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
>  -- i.e., the Python process has 3 seconds to connect to it from the very 
> moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts 
> leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a 
> call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
>.. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
> only take a couple milliseconds, about 10% of them take between 2 and 10 
> seconds, leading to about 10% of jobs failing. I don't think we can always 
> expect {{getaddrinfo}} to return instantly. More generally, Python may 
> sometimes pause for a couple seconds, which may not leave enough time for the 
> process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to 
> set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
> it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22309) Remove unused param in `LDAModel.getTopicDistributionMethod`

2017-10-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22309.
---
Resolution: Duplicate

This is actually just an addition to your change in SPARK-20930

> Remove unused param in `LDAModel.getTopicDistributionMethod`
> 
>
> Key: SPARK-22309
> URL: https://issues.apache.org/jira/browse/SPARK-22309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
> is redundant.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22318) spark stream Kafka hang at JavaStreamingContext.start, no spark job create

2017-10-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22318.
---
Resolution: Invalid

There's no bug reported here. It's normal that the job waits at start() 
forever. You need to narrow it down much further; maybe there is no data 
arriving.

> spark stream Kafka hang at JavaStreamingContext.start, no spark job create
> --
>
> Key: SPARK-22318
> URL: https://issues.apache.org/jira/browse/SPARK-22318
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: OS:Red Hat Enterprise Linux Server release 6.5
> JRE:Oracle 1.8.0.144-b01
> spark-streaming_2.11:2.1.0
> spark-streaming-kafka-0-10_2.11:2.1.0
>Reporter: iceriver322
>
> spark stream Kafka jar submitted by spark-submit to standalone spark cluster, 
> and running well for a few days. But recently, we find that no new job 
> generated for the stream,  we tried to restart the job, and restart the 
> cluster,  the stream just stuck at JavaStreamingContext.start, and WAITING 
> (on object monitor).  Thread dump :
> 2017-10-19 16:44:23
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode):
> "Attach Listener" #82 daemon prio=9 os_prio=0 tid=0x7f76f0002000 
> nid=0x3f80 waiting on condition [0x]
>java.lang.Thread.State: RUNNABLE
>Locked ownable synchronizers:
> - None
> "SparkUI-JettyScheduler" #81 daemon prio=5 os_prio=0 tid=0x7f76ac002800 
> nid=0x3d5c waiting on condition [0x7f7693bfa000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0xfa19f940> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
> - None
> "shuffle-server-3-4" #35 daemon prio=5 os_prio=0 tid=0x7f76a0041800 
> nid=0x3d34 runnable [0x7f76911e5000]
>java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
> - locked <0xf8ea3be8> (a 
> io.netty.channel.nio.SelectedSelectionKeySet)
> - locked <0xf8ee3600> (a 
> java.util.Collections$UnmodifiableSet)
> - locked <0xf8ea3ae0> (a sun.nio.ch.EPollSelectorImpl)
> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
> at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:760)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:401)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
> - None
> "shuffle-server-3-3" #34 daemon prio=5 os_prio=0 tid=0x7f76a0040800 
> nid=0x3d33 runnable [0x7f76912e6000]
>java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
> - locked <0xfc2747c0> (a 
> io.netty.channel.nio.SelectedSelectionKeySet)
> - locked <0xfc2874c0> (a 
> java.util.Collections$UnmodifiableSet)
> - locked <0xfc2746c8> (a sun.nio.ch.EPollSelectorImpl)
> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
> at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:760)
> at io.netty.channel.nio.NioEventLoop.run(

74 matches

Mail list logo