[jira] [Comment Edited] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234496#comment-15234496
 ] 

Liang-Chi Hsieh edited comment on SPARK-14520 at 4/11/16 5:09 AM:
--

Hi [~rajesh.balamohan], I submitted a PR for this issue. Can you try it? Thanks!


was (Author: viirya):
Hi [~Rajesh Balamohan], I submitted a PR for this issue. Can you try it? Thanks!

> ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true
> 
>
> Key: SPARK-14520
> URL: https://issues.apache.org/jira/browse/SPARK-14520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details: Spark build from master branch (Apr-10)
> TPC-DS at 200 GB scale stored in Parq format stored in hive.
> Ran TPC-DS Query27 via Spark beeline client with 
> "spark.sql.sources.fileScan=false".
> {noformat}
>  java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
>  cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Creating this JIRA as a placeholder to track this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234496#comment-15234496
 ] 

Liang-Chi Hsieh commented on SPARK-14520:
-

Hi [~Rajesh Balamohan], I submitted a PR for this issue. Can you try it? Thanks!

> ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true
> 
>
> Key: SPARK-14520
> URL: https://issues.apache.org/jira/browse/SPARK-14520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details: Spark build from master branch (Apr-10)
> TPC-DS at 200 GB scale stored in Parq format stored in hive.
> Ran TPC-DS Query27 via Spark beeline client with 
> "spark.sql.sources.fileScan=false".
> {noformat}
>  java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
>  cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Creating this JIRA as a placeholder to track this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14520:


Assignee: Apache Spark

> ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true
> 
>
> Key: SPARK-14520
> URL: https://issues.apache.org/jira/browse/SPARK-14520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Apache Spark
>
> Build details: Spark build from master branch (Apr-10)
> TPC-DS at 200 GB scale stored in Parq format stored in hive.
> Ran TPC-DS Query27 via Spark beeline client with 
> "spark.sql.sources.fileScan=false".
> {noformat}
>  java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
>  cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Creating this JIRA as a placeholder to track this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234495#comment-15234495
 ] 

Apache Spark commented on SPARK-14520:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/12292

> ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true
> 
>
> Key: SPARK-14520
> URL: https://issues.apache.org/jira/browse/SPARK-14520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details: Spark build from master branch (Apr-10)
> TPC-DS at 200 GB scale stored in Parq format stored in hive.
> Ran TPC-DS Query27 via Spark beeline client with 
> "spark.sql.sources.fileScan=false".
> {noformat}
>  java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
>  cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Creating this JIRA as a placeholder to track this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14520:


Assignee: (was: Apache Spark)

> ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true
> 
>
> Key: SPARK-14520
> URL: https://issues.apache.org/jira/browse/SPARK-14520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details: Spark build from master branch (Apr-10)
> TPC-DS at 200 GB scale stored in Parq format stored in hive.
> Ran TPC-DS Query27 via Spark beeline client with 
> "spark.sql.sources.fileScan=false".
> {noformat}
>  java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
>  cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Creating this JIRA as a placeholder to track this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13352) BlockFetch does not scale well on large block

2016-04-10 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234488#comment-15234488
 ] 

Zhang, Liye edited comment on SPARK-13352 at 4/11/16 5:02 AM:
--

Hi [~davies], I think this JIRA is related with 
[SPARK-14242|https://issues.apache.org/jira/browse/SPARK-142242] and 
[SPARK-14290|https://issues.apache.org/jira/browse/SPARK-14290], can you test 
with spark master branch again to see if this issue still exists?


was (Author: liyezhang556520):
Hi [~davies], I think this JIRA is related with 
[SPARK-14242|https://issues.apache.org/jira/browse/SPARK-142242] and 
[SPARK-14290|https://issues.apache.org/jira/browse/SPARK-14290], can you test 
with spark master again to see if this issue still exists?

> BlockFetch does not scale well on large block
> -
>
> Key: SPARK-13352
> URL: https://issues.apache.org/jira/browse/SPARK-13352
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Reporter: Davies Liu
>Priority: Critical
>
> BlockManager.getRemoteBytes() perform poorly on large block
> {code}
>   test("block manager") {
> val N = 500 << 20
> val bm = sc.env.blockManager
> val blockId = TaskResultBlockId(0)
> val buffer = ByteBuffer.allocate(N)
> buffer.limit(N)
> bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER)
> val result = bm.getRemoteBytes(blockId)
> assert(result.isDefined)
> assert(result.get.limit() === (N))
>   }
> {code}
> Here are runtime for different block sizes:
> {code}
> 50M3 seconds
> 100M  7 seconds
> 250M  33 seconds
> 500M 2 min
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14253) Avoid registering temporary functions in Hive

2016-04-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234492#comment-15234492
 ] 

Liang-Chi Hsieh commented on SPARK-14253:
-

This can be closed now.

> Avoid registering temporary functions in Hive
> -
>
> Key: SPARK-14253
> URL: https://issues.apache.org/jira/browse/SPARK-14253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Spark should just handle all temporary functions ourselves instead of passing 
> it to Hive, which is what the HiveFunctionRegistry does at the moment. The 
> extra call to Hive is unnecessary and potentially slow, and it makes the 
> semantics of a "temporary function" confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13352) BlockFetch does not scale well on large block

2016-04-10 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234488#comment-15234488
 ] 

Zhang, Liye commented on SPARK-13352:
-

Hi [~davies], I think this JIRA is related with 
[SPARK-14242|https://issues.apache.org/jira/browse/SPARK-142242] and 
[SPARK-14290|https://issues.apache.org/jira/browse/SPARK-14290], can you test 
with spark master again to see if this issue still exists?

> BlockFetch does not scale well on large block
> -
>
> Key: SPARK-13352
> URL: https://issues.apache.org/jira/browse/SPARK-13352
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Reporter: Davies Liu
>Priority: Critical
>
> BlockManager.getRemoteBytes() perform poorly on large block
> {code}
>   test("block manager") {
> val N = 500 << 20
> val bm = sc.env.blockManager
> val blockId = TaskResultBlockId(0)
> val buffer = ByteBuffer.allocate(N)
> buffer.limit(N)
> bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER)
> val result = bm.getRemoteBytes(blockId)
> assert(result.isDefined)
> assert(result.get.limit() === (N))
>   }
> {code}
> Here are runtime for different block sizes:
> {code}
> 50M3 seconds
> 100M  7 seconds
> 250M  33 seconds
> 500M 2 min
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource

2016-04-10 Thread Justin Pihony (JIRA)
Justin Pihony created SPARK-14525:
-

 Summary: DataFrameWriter's save method should delegate to jdbc for 
jdbc datasource
 Key: SPARK-14525
 URL: https://issues.apache.org/jira/browse/SPARK-14525
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.1
Reporter: Justin Pihony
Priority: Minor


If you call {code}df.write.format("jdbc")...save(){code} then you get an error  
bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not 
allow create table as select

save is a more intuitive guess on the appropriate method to call, so the user 
should not be punished for not knowing about the jdbc method. 

Obviously, this will require the caller to have set up the correct parameters 
for jdbc to work :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14486) For partition table, the dag occurs oom because of too many same rdds

2016-04-10 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-14486:
-
Description: 
For partition table, when partition rdds do some maps, the rdd number will 
multiple grow. So rdd number in dag will become thousands, and occurs oom.

Can we make a improvement to reduce the rdd number in dag. show the same rdds 
just one time, not each partition.

As the screen shot shows "HiveTableScan" cluster has thousands same rdds.

  was:
For partition table, when partition rdds do some maps, the rdd number will 
multiple grow. So rdd number in dag will become thounds, and occurs oom.

Can we make a improvement to reduce the rdd number in dag. show the same rdds 
just one time, not each partition.


> For partition table, the dag occurs oom because of too many same rdds
> -
>
> Key: SPARK-14486
> URL: https://issues.apache.org/jira/browse/SPARK-14486
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> For partition table, when partition rdds do some maps, the rdd number will 
> multiple grow. So rdd number in dag will become thousands, and occurs oom.
> Can we make a improvement to reduce the rdd number in dag. show the same rdds 
> just one time, not each partition.
> As the screen shot shows "HiveTableScan" cluster has thousands same rdds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14486) For partition table, the dag occurs oom because of too many same rdds

2016-04-10 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-14486:
-
Attachment: screenshot-1.png

> For partition table, the dag occurs oom because of too many same rdds
> -
>
> Key: SPARK-14486
> URL: https://issues.apache.org/jira/browse/SPARK-14486
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> For partition table, when partition rdds do some maps, the rdd number will 
> multiple grow. So rdd number in dag will become thounds, and occurs oom.
> Can we make a improvement to reduce the rdd number in dag. show the same rdds 
> just one time, not each partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14524) In SparkSQL, it can't be select column of String type because of UTF8String when setting more than 32G for executors.

2016-04-10 Thread Deng Changchun (JIRA)
Deng Changchun created SPARK-14524:
--

 Summary: In SparkSQL, it can't be select column of String type 
because of UTF8String when setting more than 32G for executors.
 Key: SPARK-14524
 URL: https://issues.apache.org/jira/browse/SPARK-14524
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
 Environment: Centos
Reporter: Deng Changchun
Priority: Critical


(related Issue:https://github.com/apache/spark/pull/8210/files)

When we set 32G(or more) for executor, select the column of String type, it 
shows the Wrong result, such as:
'abcde'   (less than 8 chars)   => ''   (it will show nothing)
'abcdefghijklmn' (more than 8 chars) =>'ijklmn' ( it will cut the the front of 
8 chars)

However, when we set 31G( or less)  for executor, all is good.

We also have debugged this problem, we found that SparkSQL uses UTF8String 
internally, it depends on some properties of locally JVM Memmory allocation ( 
see class 'org.apache.spark.unsafe.Platform'). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14523) Feature parity for Statistics ML with MLlib

2016-04-10 Thread yuhao yang (JIRA)
yuhao yang created SPARK-14523:
--

 Summary: Feature parity for Statistics ML with MLlib
 Key: SPARK-14523
 URL: https://issues.apache.org/jira/browse/SPARK-14523
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: yuhao yang


Some statistics functions have been supported by DataFrame directly. Use this 
jira to discuss/design the statistics package in Spark.ML and its function 
scope. Hypothesis test and correlation computation may still need to expose 
independent interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14522) Getting an error of BoneCP specified but not present in CLASSPATH

2016-04-10 Thread Niranjan Molkeri` (JIRA)
Niranjan Molkeri` created SPARK-14522:
-

 Summary: Getting an error of BoneCP specified but not present in 
CLASSPATH
 Key: SPARK-14522
 URL: https://issues.apache.org/jira/browse/SPARK-14522
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.6.0
 Environment: Mac OsX
Reporter: Niranjan Molkeri`
Priority: Minor


When I type spark-shell command to start up the spark shell, I am getting this 
error 

16/04/08 19:05:16 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
16/04/08 19:05:16 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-10 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234380#comment-15234380
 ] 

Rajesh Balamohan commented on SPARK-14521:
--

Build with commit f8c9beca38f1f396eb3220b23db6d77112a50293 does not have this 
issue. Suspecting it to be kryo 3.0.3 upgrade issue.

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-10 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14521:
-
Summary: StackOverflowError in Kryo when executing TPC-DS Query27  (was: 
StackOverflowError when executing TPC-DS Query27)

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14521) StackOverflowError when executing TPC-DS Query27

2016-04-10 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14521:


 Summary: StackOverflowError when executing TPC-DS Query27
 Key: SPARK-14521
 URL: https://issues.apache.org/jira/browse/SPARK-14521
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rajesh Balamohan


Build details:  Spark build from master branch (Apr-10)

DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.

Client: $SPARK_HOME/bin/beeline 

Query:  TPC-DS Query27

spark.sql.sources.fileScan=true (this is the default value anyways)

Exception:
{noformat}
Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14520:
-
Description: 
Build details: Spark build from master branch (Apr-10)

TPC-DS at 200 GB scale stored in Parq format stored in hive.

Ran TPC-DS Query27 via Spark beeline client with 
"spark.sql.sources.fileScan=false".

{noformat}
 java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
 cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}



Creating this JIRA as a placeholder to track this issue.


  was:
Build details: Spark build from master branch (Apr-10)

TPC-DS at 200 GB scale stored in Parq format stored in hive.

Ran TPC-DS Query27 via Spark beeline client.

{noformat}
 java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
 cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}


Creating this JIRA as a placeholder to track this issue.



> ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true
> 
>
> Key: SPARK-14520
> URL: https://issues.apache.org/jira/browse/SPARK-14520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details: Spark build from master branch (Apr-10)
> TPC-DS at 200 GB scale stored in Parq format stored in hive.
> Ran TPC-DS Query27 via Spark beeline client with 
> "spark.sql.sources.fileScan=false".
> {noformat}
>  java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
>  cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
> at 
> 

[jira] [Updated] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14520:
-
Description: 
Build details: Spark build from master branch (Apr-10)

TPC-DS at 200 GB scale stored in Parq format stored in hive.

Ran TPC-DS Query27 via Spark beeline client.

{noformat}
 java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
 cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}


Creating this JIRA as a placeholder to track this issue.


  was:
Build details: Spark build from master branch (Apr-10)

TPC-DS at 200 GB scale stored in Parq format stored in hive.

Ran TPC-DS Query27 via Spark beeline client.

{noformat}
 java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
 cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}


Creating this JIRA as a placeholder to fix this issue.



> ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true
> 
>
> Key: SPARK-14520
> URL: https://issues.apache.org/jira/browse/SPARK-14520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> Build details: Spark build from master branch (Apr-10)
> TPC-DS at 200 GB scale stored in Parq format stored in hive.
> Ran TPC-DS Query27 via Spark beeline client.
> {noformat}
>  java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
>  cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
> at 
> 

[jira] [Created] (SPARK-14520) ClasscastException thrown with spark.sql.parquet.enableVectorizedReader=true

2016-04-10 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14520:


 Summary: ClasscastException thrown with 
spark.sql.parquet.enableVectorizedReader=true
 Key: SPARK-14520
 URL: https://issues.apache.org/jira/browse/SPARK-14520
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rajesh Balamohan


Build details: Spark build from master branch (Apr-10)

TPC-DS at 200 GB scale stored in Parq format stored in hive.

Ran TPC-DS Query27 via Spark beeline client.

{noformat}
 java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader
 cannot be cast to org.apache.parquet.hadoop.ParquetRecordReader
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:480)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetInputFormat.createRecordReader(ParquetRelation.scala:476)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:161)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}


Creating this JIRA as a placeholder to fix this issue.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14419) Improve the HashedRelation for key fit within Long

2016-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234353#comment-15234353
 ] 

Apache Spark commented on SPARK-14419:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12289

> Improve the HashedRelation for key fit within Long
> --
>
> Key: SPARK-14419
> URL: https://issues.apache.org/jira/browse/SPARK-14419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> 1. Manage the memory by MemoryManager
> 2. Improve the memory efficiency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14289) Support multiple eviction strategies for cached RDD partitions

2016-04-10 Thread Ben Manes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234267#comment-15234267
 ] 

Ben Manes commented on SPARK-14289:
---

How about [TinyLFU|https://github.com/ben-manes/caffeine/wiki/Efficiency]? That 
technique can augment any eviction policy (LRU, LCS, etc) and filters based on 
frequency to provide near optimal hit rates. The [CountMin 
sketch|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/FrequencySketch.java]
 could be ported from [Caffeine|https://github.com/ben-manes/caffeine], or that 
library used directly if applicable.

> Support multiple eviction strategies for cached RDD partitions
> --
>
> Key: SPARK-14289
> URL: https://issues.apache.org/jira/browse/SPARK-14289
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Yuanzhen Geng
>Priority: Minor
>
> Currently, there is only eviction strategy for cached RDD partition in Spark. 
> The default RDD eviction strategy is LRU (with an additional rule that do not 
> replacing another block that belongs to the same RDD like current creating 
> partition).
> When memory space not sufficient for RDD caching, several partitions will be 
> evicted, if these partitions are used again latterly, they will be reproduced 
> by the Lineage information and cached in memory again. The reproduce phase 
> will bring in additional cost. However, LRU has no guarantee for the lowest 
> reproduce cost. 
> The first RDD that needed to be cached is usually generated by reading from 
> HDFS and doing several transformations. The reading operation usually cost 
> longer time than other Spark transformations. 
> For example, in one stage we having the following DAG structure: hdfs -> 
> \[A\] -> B -> \[C\] -> D - > \[E\] -> \[F\], RDD A, C, E, F needed to be 
> cached in memory, F is creating during this stage while A, B and E had 
> already been created in previous. When using the LRU eviction strategy, 
> partition of A will be evicted first. However, the time cost in\ [A\] -> B -> 
> \[C\] may be much less than hdfs ->\ [A\], so evict \[C\] may be better than 
> evict \[A\]. 
> A eviction strategy based on the creation cost may be better than LRU, by 
> statisticing each transformation's time during the creation of cached RDD 
> partition (e.g. \[E\] only need to statistic time cost in \[C\] -> D and D -> 
> \[E\]) and time cost in needed shuffle reading. When memory for RDD storage 
> not sufficient, partition with the least creation cost may be evicted first. 
> So this strategy for be called as LCS. My current demo show better 
> performance gain than default LRU.
> This strategy needs to consider the following situation:
> 1. Unified Memory Management is provided after Spark 1.6, memory for 
> execution during recomputing a partition may be pretty different than the 
> first time the partition created. So before better thought, LCS may not be 
> allowed in UMM mode. (Though my demo also show improvement in LCS than LRU in 
> UMM mode).
> 2. MEMORY_AND_DISK_SER or other similar storage level may serialize RDD 
> partition. By estimating ser/deserialize cost and compare to creation cost, 
> if the ser/deserialize cost even larger than recreation, not serialize but 
> directly removed from memory. As existing storage level only allowed for the 
> whole RDD, so a new storage level may be needed for RDD partition to directly 
> determine whether to serialize or just remove from memory.
> Besides LCS, FIFO or LFU is easy to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-04-10 Thread John Berryman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234264#comment-15234264
 ] 

John Berryman commented on SPARK-13587:
---

At my work we're using devpi and a homecooked proxy server to "fake" pip and 
serve up cached wheels. I'm not sure such a solution should be built into 
Spark, but it is at least a POC for being able to make the virtualenv idea fast 
(after the first build at least). Really it wouldn't be hard to set this up 
outside of Spark, you would just need to make {{pip}} actually point to 
{{my_own_specially_wheel_caching_pip}}.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14519) Cross-publish Kafka for Scala 2.12.0-M4

2016-04-10 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-14519:
--

 Summary: Cross-publish Kafka for Scala 2.12.0-M4
 Key: SPARK-14519
 URL: https://issues.apache.org/jira/browse/SPARK-14519
 Project: Spark
  Issue Type: Sub-task
Reporter: Josh Rosen


In order to build the streaming Kafka connector, we need to publish Kafka for 
Scala 2.12.0-M4. Someone should file an issue against the Kafka project and 
work with their developers to figure out what will block their upgrade / 
release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14415) All functions should show usages by command `DESC FUNCTION`

2016-04-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14415:
-
Assignee: Dongjoon Hyun

> All functions should show usages by command `DESC FUNCTION`
> ---
>
> Key: SPARK-14415
> URL: https://issues.apache.org/jira/browse/SPARK-14415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> Currently, many functions do now show usages like the followings.
> {code}
> scala> sql("desc function extended `sin`").collect().foreach(println)
> [Function: sin]
> [Class: org.apache.spark.sql.catalyst.expressions.Sin]
> [Usage: To be added.]
> [Extended Usage:
> To be added.]
> {code}
> This PR adds descriptions for functions and adds a testcase prevent adding 
> function without usage.
> {code}
> scala>  sql("desc function extended `sin`").collect().foreach(println);
> [Function: sin]
> [Class: org.apache.spark.sql.catalyst.expressions.Sin]
> [Usage: sin(x) - Returns the sine of x.]
> [Extended Usage:
> > SELECT sin(0);
>  0.0]
> {code}
> The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14415) All functions should show usages by command `DESC FUNCTION`

2016-04-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14415.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12185
[https://github.com/apache/spark/pull/12185]

> All functions should show usages by command `DESC FUNCTION`
> ---
>
> Key: SPARK-14415
> URL: https://issues.apache.org/jira/browse/SPARK-14415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> Currently, many functions do now show usages like the followings.
> {code}
> scala> sql("desc function extended `sin`").collect().foreach(println)
> [Function: sin]
> [Class: org.apache.spark.sql.catalyst.expressions.Sin]
> [Usage: To be added.]
> [Extended Usage:
> To be added.]
> {code}
> This PR adds descriptions for functions and adds a testcase prevent adding 
> function without usage.
> {code}
> scala>  sql("desc function extended `sin`").collect().foreach(println);
> [Function: sin]
> [Class: org.apache.spark.sql.catalyst.expressions.Sin]
> [Usage: sin(x) - Returns the sine of x.]
> [Extended Usage:
> > SELECT sin(0);
>  0.0]
> {code}
> The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14518) Support Comment in CREATE VIEW

2016-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14518:


Assignee: Apache Spark

> Support Comment in CREATE VIEW
> --
>
> Key: SPARK-14518
> URL: https://issues.apache.org/jira/browse/SPARK-14518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {noformat}
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {noformat}
> Now, the COMMENT clause is not supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14518) Support Comment in CREATE VIEW

2016-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234160#comment-15234160
 ] 

Apache Spark commented on SPARK-14518:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12288

> Support Comment in CREATE VIEW
> --
>
> Key: SPARK-14518
> URL: https://issues.apache.org/jira/browse/SPARK-14518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {noformat}
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {noformat}
> Now, the COMMENT clause is not supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14518) Support Comment in CREATE VIEW

2016-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14518:


Assignee: (was: Apache Spark)

> Support Comment in CREATE VIEW
> --
>
> Key: SPARK-14518
> URL: https://issues.apache.org/jira/browse/SPARK-14518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {noformat}
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {noformat}
> Now, the COMMENT clause is not supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14518) Support Comment in CREATE VIEW

2016-04-10 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14518:

Description: 
{noformat}
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
{noformat}

Now, the COMMENT clause is not supported.

  was:
{noformat}
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
{noformat}

Now, COMMENT is not supported.


> Support Comment in CREATE VIEW
> --
>
> Key: SPARK-14518
> URL: https://issues.apache.org/jira/browse/SPARK-14518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {noformat}
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {noformat}
> Now, the COMMENT clause is not supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14518) Support Comment in CREATE VIEW

2016-04-10 Thread Xiao Li (JIRA)
Xiao Li created SPARK-14518:
---

 Summary: Support Comment in CREATE VIEW
 Key: SPARK-14518
 URL: https://issues.apache.org/jira/browse/SPARK-14518
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


{noformat}
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;
{noformat}

Now, COMMENT is not supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14505) Creating two SparkContext Object in the same jvm, the first one will can not run any tasks!

2016-04-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234155#comment-15234155
 ] 

Sean Owen commented on SPARK-14505:
---

I'd say the solution is clearly to not make a second context, but, if the 
behavior can be made more robust even in this unsupported situation without any 
extra complexity or downside, seems OK

> Creating two SparkContext Object in the same jvm, the first one will can not  
> run any tasks!
> 
>
> Key: SPARK-14505
> URL: https://issues.apache.org/jira/browse/SPARK-14505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: The sea
>Priority: Minor
>
> Execute code below in spark shell:
> import org.apache.spark.SparkContext
> val sc = new SparkContext("local", "app")
> sc.range(1, 10).reduce(_ + _)
> The exception is :
> 16/04/09 15:40:01 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 
> (TID 3, 192.168.172.131): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of 
> broadcast_1
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 
> of broadcast_1
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
>   ... 11 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?

2016-04-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234153#comment-15234153
 ] 

Sean Owen commented on SPARK-14022:
---

It's pretty simple to implement even without library support. It is a valid 
quick-n-dirty dimension reduction technique. If it can be added simply it seems 
reasonable to me.

> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> ---
>
> Key: SPARK-14022
> URL: https://issues.apache.org/jira/browse/SPARK-14022
> Project: Spark
>  Issue Type: Brainstorming
>Reporter: zhengruifeng
>Priority: Minor
>
> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces 
> the dimensionality by projecting the original input space on a randomly 
> generated matrix. 
> It is fully scalable, and runs fast (maybe fastest).
> It was implemented in sklearn 
> (http://scikit-learn.org/stable/modules/random_projection.html)
> I am be willing to do this, if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14516) What about adding general clustering metrics?

2016-04-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14516:
--
Priority: Minor  (was: Major)

I personally think silhouette could be worth adding. The supervised metrics 
require a label and not sure that's generally available in clustering problems.

> What about adding general clustering metrics?
> -
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML, MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> ML/MLLIB dont have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into ML/MLLIB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14479) GLM supports output link prediction

2016-04-10 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234124#comment-15234124
 ] 

Yanbo Liang edited comment on SPARK-14479 at 4/10/16 1:59 PM:
--

Had offline discussion with [~mengxr], we all agreed [~josephkb]'s proposal. We 
can output 2 prediction columns: "predictionCol"(default) and 
"linkPredictionCol" which corresponding to "response" and "link" type of R 
predict.glm output. 
I converted this task to "Improvement" and sent a PR for supporting link 
prediction.


was (Author: yanboliang):
Had offline discussion with [~mengxr], we can output 2 prediction columns: 
"predictionCol"(default) and "linkPredictionCol" which corresponding to 
"response" and "link" type of R predict.glm output. 
I converted this task to "Improvement" and sent a PR for supporting link 
prediction.

> GLM supports output link prediction
> ---
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14479) GLM supports output link prediction

2016-04-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14479:

Summary: GLM supports output link prediction  (was: GLM predict type should 
be link or response?)

> GLM supports output link prediction
> ---
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14479) GLM predict type should be link or response?

2016-04-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14479:

Issue Type: Improvement  (was: Question)

> GLM predict type should be link or response?
> 
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14479) GLM predict type should be link or response?

2016-04-10 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234124#comment-15234124
 ] 

Yanbo Liang commented on SPARK-14479:
-

Had offline discussion with [~mengxr], we can output 2 prediction columns: 
"predictionCol"(default) and "linkPredictionCol" which corresponding to 
"response" and "link" type of R predict.glm output. 
I converted this task to "Improvement" and send a PR for supporting link 
prediction.

> GLM predict type should be link or response?
> 
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Question
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14479) GLM predict type should be link or response?

2016-04-10 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234124#comment-15234124
 ] 

Yanbo Liang edited comment on SPARK-14479 at 4/10/16 1:52 PM:
--

Had offline discussion with [~mengxr], we can output 2 prediction columns: 
"predictionCol"(default) and "linkPredictionCol" which corresponding to 
"response" and "link" type of R predict.glm output. 
I converted this task to "Improvement" and sent a PR for supporting link 
prediction.


was (Author: yanboliang):
Had offline discussion with [~mengxr], we can output 2 prediction columns: 
"predictionCol"(default) and "linkPredictionCol" which corresponding to 
"response" and "link" type of R predict.glm output. 
I converted this task to "Improvement" and send a PR for supporting link 
prediction.

> GLM predict type should be link or response?
> 
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Question
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14479) GLM predict type should be link or response?

2016-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14479:


Assignee: Apache Spark

> GLM predict type should be link or response?
> 
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Question
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14479) GLM predict type should be link or response?

2016-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234121#comment-15234121
 ] 

Apache Spark commented on SPARK-14479:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12287

> GLM predict type should be link or response?
> 
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Question
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14479) GLM predict type should be link or response?

2016-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14479:


Assignee: (was: Apache Spark)

> GLM predict type should be link or response?
> 
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Question
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14517) GLM should support predict link

2016-04-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang closed SPARK-14517.
---
Resolution: Duplicate

> GLM should support predict link
> ---
>
> Key: SPARK-14517
> URL: https://issues.apache.org/jira/browse/SPARK-14517
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-10 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234078#comment-15234078
 ] 

Nick Pentreath edited comment on SPARK-13944 at 4/10/16 12:15 PM:
--

What about the case of
{code}
dataFrame.rdd.map { case Row(v: Vector) => ... }
{code}
This returns the new {{ml.linalg}} vector, which would need to be converted to 
a {{mllib.linalg}} vector in order to be used in the {{mllib}} APIs.

And is the other way around is handled with the automatic conversion in 
{{VectorUDT}}?


was (Author: mlnick):
What about the case of {{dataFrame.rdd.map { case Row(v: Vector) => ... } }}? 
This returns the new {{ml.linalg}} vector, which would need to be converted to 
a {{mllib.linalg}} vector in order to be used in the {{mllib}} APIs.

And is the other way around is handled with the automatic conversion in 
{{VectorUDT}}?

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-10 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234078#comment-15234078
 ] 

Nick Pentreath commented on SPARK-13944:


What about the case of {{dataFrame.rdd.map { case Row(v: Vector) => ... } }}? 
This returns the new {{ml.linalg}} vector, which would need to be converted to 
a {{mllib.linalg}} vector in order to be used in the {{mllib}} APIs.

And is the other way around is handled with the automatic conversion in 
{{VectorUDT}}?

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-10 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234077#comment-15234077
 ] 

Nick Pentreath commented on SPARK-13944:


Type alias is a better solution if we aim to break compat across the project - 
it means (as far as I can see from my work on it so far) that we can avoid any 
code changes for Scala users. However it breaks everything for Java, so isn't a 
good solution if we want to keep backward compat for {{mllib}} (as you 
mentioned above already).

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14517) GLM should support predict link

2016-04-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14517:

Issue Type: Improvement  (was: Question)

> GLM should support predict link
> ---
>
> Key: SPARK-14517
> URL: https://issues.apache.org/jira/browse/SPARK-14517
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14517) GLM should support predict link

2016-04-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14517:

Summary: GLM should support predict link  (was: CLONE - GLM predict type 
should be link or response?)

> GLM should support predict link
> ---
>
> Key: SPARK-14517
> URL: https://issues.apache.org/jira/browse/SPARK-14517
> Project: Spark
>  Issue Type: Question
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14517) CLONE - GLM predict type should be link or response?

2016-04-10 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-14517:
---

 Summary: CLONE - GLM predict type should be link or response?
 Key: SPARK-14517
 URL: https://issues.apache.org/jira/browse/SPARK-14517
 Project: Spark
  Issue Type: Question
  Components: ML, SparkR
Reporter: Yanbo Liang


In R glm and glmnet, the default type of predict is "link" which is the linear 
predictor, users can specify "type = response" to output response prediction. 
Currently the ML glm predict will output "response" prediction by default, I 
think it's more reasonable. Should we change the default type of ML glm predict 
output? 
R glm: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet

Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9882) Priority-based scheduling for Spark applications

2016-04-10 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-9882.
--
Resolution: Won't Fix

> Priority-based scheduling for Spark applications
> 
>
> Key: SPARK-9882
> URL: https://issues.apache.org/jira/browse/SPARK-9882
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> We implement this patch because in our daily usage of Spark we found that 
> applications scheduling is an important issue for utilizing cluster 
> resources. Currently in standalone mode, we don't have efficient way to 
> schedule multiple Spark applications on a single cluster. We need an 
> efficient way to manage different Spark applications with different 
> priorities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9882) Priority-based scheduling for Spark applications

2016-04-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234028#comment-15234028
 ] 

Liang-Chi Hsieh edited comment on SPARK-9882 at 4/10/16 9:54 AM:
-

This PR stays for a while. As the PR doesn't get the interesting from the 
community. I will close this ticket.


was (Author: viirya):
This PR stays for a while. As the PR doesn't get the interesting form 
community. I will close this ticket.

> Priority-based scheduling for Spark applications
> 
>
> Key: SPARK-9882
> URL: https://issues.apache.org/jira/browse/SPARK-9882
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> We implement this patch because in our daily usage of Spark we found that 
> applications scheduling is an important issue for utilizing cluster 
> resources. Currently in standalone mode, we don't have efficient way to 
> schedule multiple Spark applications on a single cluster. We need an 
> efficient way to manage different Spark applications with different 
> priorities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9882) Priority-based scheduling for Spark applications

2016-04-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234028#comment-15234028
 ] 

Liang-Chi Hsieh commented on SPARK-9882:


This PR stays for a while. As the PR doesn't get the interesting form 
community. I will close this ticket.

> Priority-based scheduling for Spark applications
> 
>
> Key: SPARK-9882
> URL: https://issues.apache.org/jira/browse/SPARK-9882
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> We implement this patch because in our daily usage of Spark we found that 
> applications scheduling is an important issue for utilizing cluster 
> resources. Currently in standalone mode, we don't have efficient way to 
> schedule multiple Spark applications on a single cluster. We need an 
> efficient way to manage different Spark applications with different 
> priorities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9882) Priority-based scheduling for Spark applications

2016-04-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234027#comment-15234027
 ] 

Liang-Chi Hsieh commented on SPARK-9882:


I've updated the description. Thanks!

> Priority-based scheduling for Spark applications
> 
>
> Key: SPARK-9882
> URL: https://issues.apache.org/jira/browse/SPARK-9882
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> We implement this patch because in our daily usage of Spark we found that 
> applications scheduling is an important issue for utilizing cluster 
> resources. Currently in standalone mode, we don't have efficient way to 
> schedule multiple Spark applications on a single cluster. We need an 
> efficient way to manage different Spark applications with different 
> priorities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9882) Priority-based scheduling for Spark applications

2016-04-10 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-9882:
---
Description: 
We implement this patch because in our daily usage of Spark we found that 
applications scheduling is an important issue for utilizing cluster resources. 
Currently in standalone mode, we don't have efficient way to schedule multiple 
Spark applications on a single cluster. We need an efficient way to manage 
different Spark applications with different priorities.


  was:
This is a priority-based scheduling based on the interface proposed in #7958. 
It uses a XML configuration file to specify the submission pools for 
applications. This patch adds a new parameter `--pool` to SparkSubmit under 
standalone mode to specify which pool the submitted application should be 
assigned to. The priority of the submitted application is defined by the 
assigned pool. It also defines the cores it can acquire.



> Priority-based scheduling for Spark applications
> 
>
> Key: SPARK-9882
> URL: https://issues.apache.org/jira/browse/SPARK-9882
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> We implement this patch because in our daily usage of Spark we found that 
> applications scheduling is an important issue for utilizing cluster 
> resources. Currently in standalone mode, we don't have efficient way to 
> schedule multiple Spark applications on a single cluster. We need an 
> efficient way to manage different Spark applications with different 
> priorities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9882) Priority-based scheduling for Spark applications

2016-04-10 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234024#comment-15234024
 ] 

Mark Hamstra edited comment on SPARK-9882 at 4/10/16 9:49 AM:
--

This isn't a very well written JIRA.  You are just duplicating some of the 
description you gave in the PR.  That description is good for a PR -- it 
explains what is being done.  That is not good for a JIRA ticket, which should 
explain why something should be changed or added.  In other words, the 
"Motivation" section of your PR description really belongs here in the JIRA 
description.


was (Author: markhamstra):
This isn't a very well written JIRA.  You are just duplicating the description 
you gave in the PR.  That description is good for a PR -- it explains what is 
being done.  That is not good for a JIRA ticket, which should explain why 
something should be changed or added.

> Priority-based scheduling for Spark applications
> 
>
> Key: SPARK-9882
> URL: https://issues.apache.org/jira/browse/SPARK-9882
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> This is a priority-based scheduling based on the interface proposed in #7958. 
> It uses a XML configuration file to specify the submission pools for 
> applications. This patch adds a new parameter `--pool` to SparkSubmit under 
> standalone mode to specify which pool the submitted application should be 
> assigned to. The priority of the submitted application is defined by the 
> assigned pool. It also defines the cores it can acquire.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9882) Priority-based scheduling for Spark applications

2016-04-10 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234024#comment-15234024
 ] 

Mark Hamstra commented on SPARK-9882:
-

This isn't a very well written JIRA.  You are just duplicating the description 
you gave in the PR.  That description is good for a PR -- it explains what is 
being done.  That is not good for a JIRA ticket, which should explain why 
something should be changed or added.

> Priority-based scheduling for Spark applications
> 
>
> Key: SPARK-9882
> URL: https://issues.apache.org/jira/browse/SPARK-9882
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> This is a priority-based scheduling based on the interface proposed in #7958. 
> It uses a XML configuration file to specify the submission pools for 
> applications. This patch adds a new parameter `--pool` to SparkSubmit under 
> standalone mode to specify which pool the submitted application should be 
> assigned to. The priority of the submitted application is defined by the 
> assigned pool. It also defines the cores it can acquire.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14298) LDA should support disable checkpoint

2016-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233996#comment-15233996
 ] 

Apache Spark commented on SPARK-14298:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12286

> LDA should support disable checkpoint
> -
>
> Key: SPARK-14298
> URL: https://issues.apache.org/jira/browse/SPARK-14298
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> LDA should support disable checkpoint by setting checkpointInterval = -1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14497) Use top instead of sortBy() to get top N frequent words as dict in ConutVectorizer

2016-04-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14497:
--
Assignee: Feng Wang

> Use top instead of sortBy() to get top N frequent words as dict in 
> ConutVectorizer
> --
>
> Key: SPARK-14497
> URL: https://issues.apache.org/jira/browse/SPARK-14497
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feng Wang
>Assignee: Feng Wang
> Fix For: 2.0.0
>
>
> It's not necessary to sort the whole rdd to get top n frequent words.
> // Sort terms to select vocab
> wordCounts.sortBy(_._2, ascending = false).take(vocSize)
>   
> we could use top() instead since: 
> top - O ( n )
> sortBy - O (n*logn)
> A minor side effect introduced by top() using default implicit Ordering in 
> Tuple2: 
> if the terms with same TF in dictionary would be sorted in descending order.
> (a:1), (b:1),(c:1)  => dict: [c, b, a]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14497) Use top instead of sortBy() to get top N frequent words as dict in ConutVectorizer

2016-04-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14497.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12265
[https://github.com/apache/spark/pull/12265]

> Use top instead of sortBy() to get top N frequent words as dict in 
> ConutVectorizer
> --
>
> Key: SPARK-14497
> URL: https://issues.apache.org/jira/browse/SPARK-14497
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feng Wang
> Fix For: 2.0.0
>
>
> It's not necessary to sort the whole rdd to get top n frequent words.
> // Sort terms to select vocab
> wordCounts.sortBy(_._2, ascending = false).take(vocSize)
>   
> we could use top() instead since: 
> top - O ( n )
> sortBy - O (n*logn)
> A minor side effect introduced by top() using default implicit Ordering in 
> Tuple2: 
> if the terms with same TF in dictionary would be sorted in descending order.
> (a:1), (b:1),(c:1)  => dict: [c, b, a]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14363) Executor OOM due to a memory leak in Sorter

2016-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233978#comment-15233978
 ] 

Apache Spark commented on SPARK-14363:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/12285

> Executor OOM due to a memory leak in Sorter
> ---
>
> Key: SPARK-14363
> URL: https://issues.apache.org/jira/browse/SPARK-14363
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> While running a Spark job, we see that the job fails because of executor OOM 
> with following stack trace - 
> {code}
> java.lang.OutOfMemoryError: Unable to acquire 76 bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:326)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:341)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The issue is that there is a memory leak in the Sorter.  When the 
> UnsafeExternalSorter spills the data to disk, it does not free up the 
> underlying pointer array. As a result, we see a lot of executor OOM and also 
> memory under utilization.
> This is a regression partially introduced in PR 
> https://github.com/apache/spark/pull/9241



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14363) Executor OOM due to a memory leak in Sorter

2016-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14363:


Assignee: (was: Apache Spark)

> Executor OOM due to a memory leak in Sorter
> ---
>
> Key: SPARK-14363
> URL: https://issues.apache.org/jira/browse/SPARK-14363
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> While running a Spark job, we see that the job fails because of executor OOM 
> with following stack trace - 
> {code}
> java.lang.OutOfMemoryError: Unable to acquire 76 bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:326)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:341)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The issue is that there is a memory leak in the Sorter.  When the 
> UnsafeExternalSorter spills the data to disk, it does not free up the 
> underlying pointer array. As a result, we see a lot of executor OOM and also 
> memory under utilization.
> This is a regression partially introduced in PR 
> https://github.com/apache/spark/pull/9241



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14363) Executor OOM due to a memory leak in Sorter

2016-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14363:


Assignee: Apache Spark

> Executor OOM due to a memory leak in Sorter
> ---
>
> Key: SPARK-14363
> URL: https://issues.apache.org/jira/browse/SPARK-14363
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> While running a Spark job, we see that the job fails because of executor OOM 
> with following stack trace - 
> {code}
> java.lang.OutOfMemoryError: Unable to acquire 76 bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:326)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:341)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The issue is that there is a memory leak in the Sorter.  When the 
> UnsafeExternalSorter spills the data to disk, it does not free up the 
> underlying pointer array. As a result, we see a lot of executor OOM and also 
> memory under utilization.
> This is a regression partially introduced in PR 
> https://github.com/apache/spark/pull/9241



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14363) Executor OOM due to a memory leak in Sorter

2016-04-10 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-14363:

Description: 
While running a Spark job, we see that the job fails because of executor OOM 
with following stack trace - 
{code}
java.lang.OutOfMemoryError: Unable to acquire 76 bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:326)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:341)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{code}

The issue is that there is a memory leak in the Sorter.  When the 
UnsafeExternalSorter spills the data to disk, it does not free up the 
underlying pointer array. As a result, we see a lot of executor OOM and also 
memory under utilization.

This is a regression partially introduced in PR 
https://github.com/apache/spark/pull/9241

  was:
While running a Spark job, we see that the job fails because of executor OOM 
with following stack trace - 
{code}
java.lang.OutOfMemoryError: Unable to acquire 76 bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:326)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:341)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at 

[jira] [Updated] (SPARK-14363) Executor OOM due to a memory leak in Sorter

2016-04-10 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-14363:

Description: 
While running a Spark job, we see that the job fails because of executor OOM 
with following stack trace - 
{code}
java.lang.OutOfMemoryError: Unable to acquire 76 bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:326)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:341)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{code}

The issue is that there is a memory leak in the Sorter.  When the 
UnsafeExternalSorter spills the data to disk, it does not free up the 
underlying pointer array. As a result, we see a lot of executor OOM and also 
memory under utilization.

  was:
While running a Spark job, we see that the job fails because of executor OOM 
with following stack trace - 
{code}
java.lang.OutOfMemoryError: Unable to acquire 76 bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:326)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:341)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 

[jira] [Updated] (SPARK-14363) Executor OOM due to a memory leak in Sorter

2016-04-10 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-14363:

Summary: Executor OOM due to a memory leak in Sorter  (was: Executor OOM 
while trying to acquire new page from the memory manager)

> Executor OOM due to a memory leak in Sorter
> ---
>
> Key: SPARK-14363
> URL: https://issues.apache.org/jira/browse/SPARK-14363
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> While running a Spark job, we see that the job fails because of executor OOM 
> with following stack trace - 
> {code}
> java.lang.OutOfMemoryError: Unable to acquire 76 bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:326)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:341)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-10 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233886#comment-15233886
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 4/10/16 7:23 AM:


Hi [~sunrui],

I have a question regarding your suggestion about adding a new 
"GroupedData.flatMapRGroups" function according to the following document:
https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9

It seems that some changes have happened in SparkSQL. According to 1.6.1 there 
was a scala class called:
https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala

This doesn't seem to exist in 2.0.0

I was thinking to add the flatMapRGroups helper function to 
org.apache.spark.sql.KeyValueGroupedDataset or 
org.apache.spark.sql.RelationalGroupedDataset. What do you think ?

Thank you,
Narine



was (Author: narine):
Hi [~sunrui],

I have a question regarding your suggestion about adding a new 
"GroupedData.flatMapRGroups" function according to the following document:
https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9

It seems that some changes has happened in SparkSQL. According to 1.6.1 there 
was a scala class called:
https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala

This doesn't seem to exist in 2.0.0

I was thinking to add the flatMapRGroups helper function to 
org.apache.spark.sql.KeyValueGroupedDataset or 
org.apache.spark.sql.RelationalGroupedDataset. What do you think ?

Thank you,
Narine


> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14516) What about adding general clustering metrics?

2016-04-10 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233946#comment-15233946
 ] 

zhengruifeng commented on SPARK-14516:
--

cc [~mengxr] [~josephkb] [~yanboliang]

> What about adding general clustering metrics?
> -
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML, MLlib
>Reporter: zhengruifeng
>
> ML/MLLIB dont have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into ML/MLLIB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14516) What about adding general clustering metrics?

2016-04-10 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14516:


 Summary: What about adding general clustering metrics?
 Key: SPARK-14516
 URL: https://issues.apache.org/jira/browse/SPARK-14516
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: zhengruifeng


ML/MLLIB dont have any general purposed clustering metrics with a ground truth.
In 
[Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
 there are several kinds of metrics for this.
It may be meaningful to add some clustering metrics into ML/MLLIB.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?

2016-04-10 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233941#comment-15233941
 ] 

zhengruifeng commented on SPARK-14022:
--

cc [~yanboliang] [~mengxr] [~josephkb]

> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> ---
>
> Key: SPARK-14022
> URL: https://issues.apache.org/jira/browse/SPARK-14022
> Project: Spark
>  Issue Type: Brainstorming
>Reporter: zhengruifeng
>Priority: Minor
>
> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces 
> the dimensionality by projecting the original input space on a randomly 
> generated matrix. 
> It is fully scalable, and runs fast (maybe fastest).
> It was implemented in sklearn 
> (http://scikit-learn.org/stable/modules/random_projection.html)
> I am be willing to do this, if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?

2016-04-10 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233937#comment-15233937
 ] 

zhengruifeng commented on SPARK-14022:
--

Ok, I change the Type from Question to Brainstroming.
I reopen this JIRA because I think it maybe nice to add RandomProjection 
algorithm.

> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> ---
>
> Key: SPARK-14022
> URL: https://issues.apache.org/jira/browse/SPARK-14022
> Project: Spark
>  Issue Type: Brainstorming
>Reporter: zhengruifeng
>Priority: Minor
>
> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces 
> the dimensionality by projecting the original input space on a randomly 
> generated matrix. 
> It is fully scalable, and runs fast (maybe fastest).
> It was implemented in sklearn 
> (http://scikit-learn.org/stable/modules/random_projection.html)
> I am be willing to do this, if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?

2016-04-10 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reopened SPARK-14022:
--

There may need some discuss on whether to add RandomProjection or Not. 

> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> ---
>
> Key: SPARK-14022
> URL: https://issues.apache.org/jira/browse/SPARK-14022
> Project: Spark
>  Issue Type: Brainstorming
>Reporter: zhengruifeng
>Priority: Minor
>
> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces 
> the dimensionality by projecting the original input space on a randomly 
> generated matrix. 
> It is fully scalable, and runs fast (maybe fastest).
> It was implemented in sklearn 
> (http://scikit-learn.org/stable/modules/random_projection.html)
> I am be willing to do this, if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?

2016-04-10 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14022:
-
Issue Type: Brainstorming  (was: Question)

> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> ---
>
> Key: SPARK-14022
> URL: https://issues.apache.org/jira/browse/SPARK-14022
> Project: Spark
>  Issue Type: Brainstorming
>Reporter: zhengruifeng
>Priority: Minor
>
> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces 
> the dimensionality by projecting the original input space on a randomly 
> generated matrix. 
> It is fully scalable, and runs fast (maybe fastest).
> It was implemented in sklearn 
> (http://scikit-learn.org/stable/modules/random_projection.html)
> I am be willing to do this, if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14357) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job failure

2016-04-10 Thread Jason Moore (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Moore resolved SPARK-14357.
-
Resolution: Fixed

Issue resolved by pull request 12228
https://github.com/apache/spark/pull/12228

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job failure
> -
>
> Key: SPARK-14357
> URL: https://issues.apache.org/jira/browse/SPARK-14357
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Jason Moore
>Assignee: Jason Moore
>Priority: Critical
> Fix For: 1.6.2, 2.0.0, 1.5.2
>
>
> Speculation can often result in a CommitDeniedException, but ideally this 
> shouldn't result in the job failing.  So changes were made along with 
> SPARK-8167 to ensure that the CommitDeniedException is caught and given a 
> failure reason that doesn't increment the failure count.
> However, I'm still noticing that this exception is causing jobs to fail using 
> the 1.6.1 release version.
> {noformat}
> 16/04/04 11:36:02 ERROR InsertIntoHadoopFsRelation: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in 
> stage 315.0 failed 8 times, most recent failure: Lost task 18.8 in stage 
> 315.0 (TID 100793, qaphdd099.quantium.com.au.local): 
> org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Failed to commit task
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:287)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:267)
> ... 8 more
> Caused by: org.apache.spark.executor.CommitDeniedException: 
> attempt_201604041136_0315_m_18_8: Not committed because the driver did 
> not authorize commit
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:135)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitTask(WriterContainer.scala:219)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:282)
> ... 9 more
> {noformat}
> It seems to me that the CommitDeniedException gets wrapped into a 
> RuntimeException at 
> [WriterContainer.scala#L286|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L286]
>  and then into a SparkException at 
> [InsertIntoHadoopFsRelation.scala#L154|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L154]
>  which results in it not being able to be handled properly at 
> [Executor.scala#L290|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/executor/Executor.scala#L290]
> The solution might be that this catch block should type match on the 
> inner-most cause of an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14357) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job failure

2016-04-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14357:
--
Assignee: Jason Moore

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job failure
> -
>
> Key: SPARK-14357
> URL: https://issues.apache.org/jira/browse/SPARK-14357
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Jason Moore
>Assignee: Jason Moore
>Priority: Critical
> Fix For: 1.5.2, 1.6.2, 2.0.0
>
>
> Speculation can often result in a CommitDeniedException, but ideally this 
> shouldn't result in the job failing.  So changes were made along with 
> SPARK-8167 to ensure that the CommitDeniedException is caught and given a 
> failure reason that doesn't increment the failure count.
> However, I'm still noticing that this exception is causing jobs to fail using 
> the 1.6.1 release version.
> {noformat}
> 16/04/04 11:36:02 ERROR InsertIntoHadoopFsRelation: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in 
> stage 315.0 failed 8 times, most recent failure: Lost task 18.8 in stage 
> 315.0 (TID 100793, qaphdd099.quantium.com.au.local): 
> org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Failed to commit task
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:287)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:267)
> ... 8 more
> Caused by: org.apache.spark.executor.CommitDeniedException: 
> attempt_201604041136_0315_m_18_8: Not committed because the driver did 
> not authorize commit
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:135)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitTask(WriterContainer.scala:219)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:282)
> ... 9 more
> {noformat}
> It seems to me that the CommitDeniedException gets wrapped into a 
> RuntimeException at 
> [WriterContainer.scala#L286|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L286]
>  and then into a SparkException at 
> [InsertIntoHadoopFsRelation.scala#L154|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L154]
>  which results in it not being able to be handled properly at 
> [Executor.scala#L290|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/executor/Executor.scala#L290]
> The solution might be that this catch block should type match on the 
> inner-most cause of an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14357) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job failure

2016-04-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14357:
--
Target Version/s: 1.5.2, 1.6.2, 2.0.0
   Fix Version/s: 1.5.2
  2.0.0
  1.6.2

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job failure
> -
>
> Key: SPARK-14357
> URL: https://issues.apache.org/jira/browse/SPARK-14357
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Jason Moore
>Priority: Critical
> Fix For: 1.5.2, 1.6.2, 2.0.0
>
>
> Speculation can often result in a CommitDeniedException, but ideally this 
> shouldn't result in the job failing.  So changes were made along with 
> SPARK-8167 to ensure that the CommitDeniedException is caught and given a 
> failure reason that doesn't increment the failure count.
> However, I'm still noticing that this exception is causing jobs to fail using 
> the 1.6.1 release version.
> {noformat}
> 16/04/04 11:36:02 ERROR InsertIntoHadoopFsRelation: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in 
> stage 315.0 failed 8 times, most recent failure: Lost task 18.8 in stage 
> 315.0 (TID 100793, qaphdd099.quantium.com.au.local): 
> org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Failed to commit task
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:287)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:267)
> ... 8 more
> Caused by: org.apache.spark.executor.CommitDeniedException: 
> attempt_201604041136_0315_m_18_8: Not committed because the driver did 
> not authorize commit
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:135)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitTask(WriterContainer.scala:219)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.commitTask$1(WriterContainer.scala:282)
> ... 9 more
> {noformat}
> It seems to me that the CommitDeniedException gets wrapped into a 
> RuntimeException at 
> [WriterContainer.scala#L286|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L286]
>  and then into a SparkException at 
> [InsertIntoHadoopFsRelation.scala#L154|https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L154]
>  which results in it not being able to be handled properly at 
> [Executor.scala#L290|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/executor/Executor.scala#L290]
> The solution might be that this catch block should type match on the 
> inner-most cause of an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14455) ReceiverTracker#allocatedExecutors throw NPE for receiver-less streaming application

2016-04-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14455.
---
  Resolution: Fixed
Assignee: Saisai Shao
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> ReceiverTracker#allocatedExecutors throw NPE for receiver-less streaming 
> application
> 
>
> Key: SPARK-14455
> URL: https://issues.apache.org/jira/browse/SPARK-14455
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.0.0
>
>
> When testing streaming dynamic allocation with direct Kafka, streaming 
> {{ExecutorAllocationManager}} throws NPE which is due to {{ReceiverTracker}} 
> not start, it will happen when running receiver-less streaming application 
> like direct Kafka. Here is the exception log:
> {noformat}
> Exception in thread "RecurringTimer - streaming-executor-allocation-manager" 
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker.allocatedExecutors(ReceiverTracker.scala:244)
>   at 
> org.apache.spark.streaming.scheduler.ExecutorAllocationManager.killExecutor(ExecutorAllocationManager.scala:124)
>   at 
> org.apache.spark.streaming.scheduler.ExecutorAllocationManager.org$apache$spark$streaming$scheduler$ExecutorAllocationManager$$manageAllocation(ExecutorAllocationManager.scala:100)
>   at 
> org.apache.spark.streaming.scheduler.ExecutorAllocationManager$$anonfun$1.apply$mcVJ$sp(ExecutorAllocationManager.scala:66)
>   at 
> org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
>   at 
> org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
>   at 
> org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14506) HiveClientImpl's toHiveTable misses a table property for external tables

2016-04-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14506.
---
   Resolution: Fixed
 Assignee: Yin Huai
Fix Version/s: 2.0.0

> HiveClientImpl's toHiveTable misses a table property for external tables
> 
>
> Key: SPARK-14506
> URL: https://issues.apache.org/jira/browse/SPARK-14506
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For an external table stored in Hive metastore, its table property needs to 
> have a filed called EXTERNAL and its value needs to be TRUE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14406) Drop Table

2016-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233908#comment-15233908
 ] 

Apache Spark commented on SPARK-14406:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12284

> Drop Table
> --
>
> Key: SPARK-14406
> URL: https://issues.apache.org/jira/browse/SPARK-14406
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Right now, DropTable command is in hive module. We should remove the call of 
> runSqlHive and move it to sql/core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14362) DDL Native Support: Drop View

2016-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233907#comment-15233907
 ] 

Apache Spark commented on SPARK-14362:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12284

> DDL Native Support: Drop View
> -
>
> Key: SPARK-14362
> URL: https://issues.apache.org/jira/browse/SPARK-14362
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Native parsing and native analysis of DDL command: Drop View.
> Based on the HIVE DDL document for 
> [DROP_VIEW_WEB_LINK](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-
> DropView
> ), `DROP VIEW` is defined as, 
> Syntax:
> {noformat}
> DROP VIEW [IF EXISTS] [db_name.]view_name;
> {noformat}
>  - to remove metadata for the specified view. 
>  - illegal to use DROP TABLE on a view.
>  - illegal to use DROP VIEW on a table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org