[jira] [Assigned] (SPARK-19706) add Column.contains in pyspark

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19706:


Assignee: Apache Spark  (was: Wenchen Fan)

> add Column.contains in pyspark
> --
>
> Key: SPARK-19706
> URL: https://issues.apache.org/jira/browse/SPARK-19706
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19706) add Column.contains in pyspark

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19706:


Assignee: Wenchen Fan  (was: Apache Spark)

> add Column.contains in pyspark
> --
>
> Key: SPARK-19706
> URL: https://issues.apache.org/jira/browse/SPARK-19706
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19706) add Column.contains in pyspark

2017-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880062#comment-15880062
 ] 

Apache Spark commented on SPARK-19706:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17036

> add Column.contains in pyspark
> --
>
> Key: SPARK-19706
> URL: https://issues.apache.org/jira/browse/SPARK-19706
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD

2017-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880058#comment-15880058
 ] 

Apache Spark commented on SPARK-19705:
--

User 'tanejagagan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17035

> Preferred location supporting HDFS Cache for FileScanRDD
> 
>
> Key: SPARK-19705
> URL: https://issues.apache.org/jira/browse/SPARK-19705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating 
> preferredLocations, FileScanRDD do not take into account HDFS cache while 
> calculating preferredLocations
> The enhancement can be easily implemented for large files where FilePartition 
> only contains single HDFS file
> The enhancement will also result in significant performance improvement for 
> cached hdfs partitions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19705:


Assignee: (was: Apache Spark)

> Preferred location supporting HDFS Cache for FileScanRDD
> 
>
> Key: SPARK-19705
> URL: https://issues.apache.org/jira/browse/SPARK-19705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating 
> preferredLocations, FileScanRDD do not take into account HDFS cache while 
> calculating preferredLocations
> The enhancement can be easily implemented for large files where FilePartition 
> only contains single HDFS file
> The enhancement will also result in significant performance improvement for 
> cached hdfs partitions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19705:


Assignee: Apache Spark

> Preferred location supporting HDFS Cache for FileScanRDD
> 
>
> Key: SPARK-19705
> URL: https://issues.apache.org/jira/browse/SPARK-19705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>Assignee: Apache Spark
>
> Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating 
> preferredLocations, FileScanRDD do not take into account HDFS cache while 
> calculating preferredLocations
> The enhancement can be easily implemented for large files where FilePartition 
> only contains single HDFS file
> The enhancement will also result in significant performance improvement for 
> cached hdfs partitions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19706) add Column.contains in pyspark

2017-02-22 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19706:
---

 Summary: add Column.contains in pyspark
 Key: SPARK-19706
 URL: https://issues.apache.org/jira/browse/SPARK-19706
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD

2017-02-22 Thread gagan taneja (JIRA)
gagan taneja created SPARK-19705:


 Summary: Preferred location supporting HDFS Cache for FileScanRDD
 Key: SPARK-19705
 URL: https://issues.apache.org/jira/browse/SPARK-19705
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: gagan taneja


Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating 
preferredLocations, FileScanRDD do not take into account HDFS cache while 
calculating preferredLocations
The enhancement can be easily implemented for large files where FilePartition 
only contains single HDFS file
The enhancement will also result in significant performance improvement for 
cached hdfs partitions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-02-22 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-19704:


 Summary: AFTSurvivalRegression should support numeric censorCol
 Key: SPARK-19704
 URL: https://issues.apache.org/jira/browse/SPARK-19704
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: zhengruifeng
Priority: Minor


AFTSurvivalRegression should support numeric censorCol



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19704:


Assignee: (was: Apache Spark)

> AFTSurvivalRegression should support numeric censorCol
> --
>
> Key: SPARK-19704
> URL: https://issues.apache.org/jira/browse/SPARK-19704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> AFTSurvivalRegression should support numeric censorCol



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19704:


Assignee: Apache Spark

> AFTSurvivalRegression should support numeric censorCol
> --
>
> Key: SPARK-19704
> URL: https://issues.apache.org/jira/browse/SPARK-19704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> AFTSurvivalRegression should support numeric censorCol



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879997#comment-15879997
 ] 

Apache Spark commented on SPARK-19704:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/17034

> AFTSurvivalRegression should support numeric censorCol
> --
>
> Key: SPARK-19704
> URL: https://issues.apache.org/jira/browse/SPARK-19704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> AFTSurvivalRegression should support numeric censorCol



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19695) Throw an exception if a `columnNameOfCorruptRecord` field violates requirements in Json formats

2017-02-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19695:
---

Assignee: Takeshi Yamamuro

> Throw an exception if a `columnNameOfCorruptRecord` field violates 
> requirements in Json formats
> ---
>
> Key: SPARK-19695
> URL: https://issues.apache.org/jira/browse/SPARK-19695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.2.0
>
>
> This ticket comes from https://github.com/apache/spark/pull/16928, and fixes 
> a json behaviour along with the CSV one. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19695) Throw an exception if a `columnNameOfCorruptRecord` field violates requirements in Json formats

2017-02-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19695.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17023
[https://github.com/apache/spark/pull/17023]

> Throw an exception if a `columnNameOfCorruptRecord` field violates 
> requirements in Json formats
> ---
>
> Key: SPARK-19695
> URL: https://issues.apache.org/jira/browse/SPARK-19695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.2.0
>
>
> This ticket comes from https://github.com/apache/spark/pull/16928, and fixes 
> a json behaviour along with the CSV one. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory

2017-02-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879893#comment-15879893
 ] 

Saisai Shao commented on SPARK-19688:
-

I see. So what issue did you encounter when you restart the application 
manually, or you just saw the abnormal credential configuration?

>From my understanding, this credential configuration will be overwritten when 
>you restart the application, so it should be fine.

> Spark on Yarn Credentials File set to different application directory
> -
>
> Key: SPARK-19688
> URL: https://issues.apache.org/jira/browse/SPARK-19688
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, YARN
>Affects Versions: 1.6.3
>Reporter: Devaraj Jonnadula
>Priority: Minor
>
> spark.yarn.credentials.file property is set to different application Id 
> instead of actual Application Id 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory

2017-02-22 Thread Devaraj Jonnadula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879888#comment-15879888
 ] 

Devaraj Jonnadula commented on SPARK-19688:
---

[~jerryshao] I did not check for Yarn's reattempt. I am seeing this behavior 
for manual restarts.

> Spark on Yarn Credentials File set to different application directory
> -
>
> Key: SPARK-19688
> URL: https://issues.apache.org/jira/browse/SPARK-19688
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, YARN
>Affects Versions: 1.6.3
>Reporter: Devaraj Jonnadula
>Priority: Minor
>
> spark.yarn.credentials.file property is set to different application Id 
> instead of actual Application Id 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory

2017-02-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879883#comment-15879883
 ] 

Saisai Shao commented on SPARK-19688:
-

[~j.devaraj], when you say Spark application is restarted, are you pointing to 
yarn's reattempt mechanism or you manually restart the application?

> Spark on Yarn Credentials File set to different application directory
> -
>
> Key: SPARK-19688
> URL: https://issues.apache.org/jira/browse/SPARK-19688
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, YARN
>Affects Versions: 1.6.3
>Reporter: Devaraj Jonnadula
>Priority: Minor
>
> spark.yarn.credentials.file property is set to different application Id 
> instead of actual Application Id 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2017-02-22 Thread Luciano Resende (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luciano Resende reopened SPARK-5159:


Reopened due to comments above

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-22 Thread Deenbandhu Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879823#comment-15879823
 ] 

Deenbandhu Agarwal commented on SPARK-19644:


I am using scala 2.11

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: Dominator_tree.png, heapdump.png, Path2GCRoot.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application

2017-02-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16122.

   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.2.0

> Spark History Server REST API missing an environment endpoint per application
> -
>
> Key: SPARK-16122
> URL: https://issues.apache.org/jira/browse/SPARK-16122
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Web UI
>Affects Versions: 1.6.1
>Reporter: Neelesh Srinivas Salian
>Assignee: Genmao Yu
>Priority: Minor
>  Labels: Docs, WebUI
> Fix For: 2.2.0
>
>
> The WebUI for the Spark History Server has the Environment tab that allows 
> you to view the Environment for that job.
> With Runtime , Spark properties...etc.
> How about adding an endpoint to the REST API that looks and points to this 
> environment tab for that application?
> /applications/[app-id]/environment
> Added Docs too so that we can spawn a subsequent Documentation addition to 
> get it included in the API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19490) Hive partition columns are case-sensitive

2017-02-22 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-19490:
--
Description: 
The real partitions columns are lower case (year, month, day)
{code}
Caused by: java.lang.RuntimeException: Expected only partition pruning 
predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
at 
org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
at 
org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85)
at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213)
at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
{code}

Use these sql can reproduce this bug:
CREATE TABLE partition_test (key Int) partitioned by (date string)
SELECT * FROM partition_test where DATE = '20170101'

  was:
The real partitions columns are lower case (year, month, day)
{code}
Caused by: java.lang.RuntimeException: Expected only partition pruning 
predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
at 

[jira] [Updated] (SPARK-19490) Hive partition columns are case-sensitive

2017-02-22 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-19490:
--
Description: 
The real partitions columns are lower case (year, month, day)
{code}
Caused by: java.lang.RuntimeException: Expected only partition pruning 
predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
at 
org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
at 
org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85)
at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213)
at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
{code}

Use this sql can reproduce this bug:
CREATE TABLE partition_test (key Int) partitioned by (date string)
SELECT * FROM partition_test where DATE = '20170101'

  was:
The real partitions columns are lower case (year, month, day)
{code}
Caused by: java.lang.RuntimeException: Expected only partition pruning 
predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
at 

[jira] [Updated] (SPARK-19490) Hive partition columns are case-sensitive

2017-02-22 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-19490:
--
Description: 
The real partitions columns are lower case (year, month, day)
{code}
Caused by: java.lang.RuntimeException: Expected only partition pruning 
predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
at 
org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
at 
org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85)
at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213)
at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
{code}

use this sql can reproduce this bug:
CREATE TABLE partition_test (key Int) partitioned by (date string)
SELECT * FROM partition_test where DATE = '20170101'

  was:
The real partitions columns are lower case (year, month, day)
{code}
Caused by: java.lang.RuntimeException: Expected only partition pruning 
predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
at 
org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
at 

[jira] [Assigned] (SPARK-15615) Support for creating a dataframe from JSON in Dataset[String]

2017-02-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-15615:
---

Assignee: PJ Fanning

> Support for creating a dataframe from JSON in Dataset[String] 
> --
>
> Key: SPARK-15615
> URL: https://issues.apache.org/jira/browse/SPARK-15615
> Project: Spark
>  Issue Type: Bug
>Reporter: PJ Fanning
>Assignee: PJ Fanning
> Fix For: 2.2.0
>
>
> We should deprecate DataFrameReader.scala json(rdd: RDD[String]) and support 
> json(ds: Dataset[String]) instead



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15615) Support for creating a dataframe from JSON in Dataset[String]

2017-02-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15615.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16895
[https://github.com/apache/spark/pull/16895]

> Support for creating a dataframe from JSON in Dataset[String] 
> --
>
> Key: SPARK-15615
> URL: https://issues.apache.org/jira/browse/SPARK-15615
> Project: Spark
>  Issue Type: Bug
>Reporter: PJ Fanning
> Fix For: 2.2.0
>
>
> We should deprecate DataFrameReader.scala json(rdd: RDD[String]) and support 
> json(ds: Dataset[String]) instead



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19658) Set NumPartitions of RepartitionByExpression In Analyzer

2017-02-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19658.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16988
[https://github.com/apache/spark/pull/16988]

> Set NumPartitions of RepartitionByExpression In Analyzer
> 
>
> Key: SPARK-19658
> URL: https://issues.apache.org/jira/browse/SPARK-19658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> Currently, if {{NumPartitions}} is not set, we will set it using 
> `spark.sql.shuffle.partitions` in Planner. However, this is not following 
> general resolution process. We should do it in Analyzer and then Optimizer 
> can use the value for optimization. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

2017-02-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879629#comment-15879629
 ] 

Hyukjin Kwon commented on SPARK-14480:
--

This seems not blocked by any of those [~pes2009k]. I sent a PR for multiple 
line support https://github.com/apache/spark/pull/16976

> Remove meaningless StringIteratorReader for CSV data source for better 
> performance
> --
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions

2017-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879614#comment-15879614
 ] 

Apache Spark commented on SPARK-19460:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/17032

> Update dataset used in R documentation, examples to reduce warning noise and 
> confusions
> ---
>
> Key: SPARK-19460
> URL: https://issues.apache.org/jira/browse/SPARK-19460
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> Running build we have a bunch of warnings from using the `iris` dataset, for 
> example.
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> These are the results of having `.` in the column name. For reference, see 
> SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't 
> support that there then we should strongly consider using other dataset 
> without `.`, eg. `cars`
> And we should update this in API doc (roxygen2 doc string), vignettes, 
> programming guide, R code example.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19460:


Assignee: Apache Spark

> Update dataset used in R documentation, examples to reduce warning noise and 
> confusions
> ---
>
> Key: SPARK-19460
> URL: https://issues.apache.org/jira/browse/SPARK-19460
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>
> Running build we have a bunch of warnings from using the `iris` dataset, for 
> example.
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> These are the results of having `.` in the column name. For reference, see 
> SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't 
> support that there then we should strongly consider using other dataset 
> without `.`, eg. `cars`
> And we should update this in API doc (roxygen2 doc string), vignettes, 
> programming guide, R code example.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19460:


Assignee: (was: Apache Spark)

> Update dataset used in R documentation, examples to reduce warning noise and 
> confusions
> ---
>
> Key: SPARK-19460
> URL: https://issues.apache.org/jira/browse/SPARK-19460
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> Running build we have a bunch of warnings from using the `iris` dataset, for 
> example.
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> These are the results of having `.` in the column name. For reference, see 
> SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't 
> support that there then we should strongly consider using other dataset 
> without `.`, eg. `cars`
> And we should update this in API doc (roxygen2 doc string), vignettes, 
> programming guide, R code example.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18452) Support History Server UI to use SPNEGO

2017-02-22 Thread Shi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875400#comment-15875400
 ] 

Shi Wang edited comment on SPARK-18452 at 2/23/17 12:42 AM:


Both Spark Thrift Server UI and History server UI could be configured to use 
SPNEGO, using "spark.ui.filters" and "spark.${filtername}.params". 


was (Author: wancy):
Both Spark Thrift Server UI and History server UI could be configured to use 
SPNEGO, using "spark.ui.filters" and "spark.${filtername}.params". 

> Support History Server UI to use SPNEGO
> ---
>
> Key: SPARK-18452
> URL: https://issues.apache.org/jira/browse/SPARK-18452
> Project: Spark
>  Issue Type: Task
>Affects Versions: 2.0.2
>Reporter: Shi Wang
>
> Currently almost all the hadoop component UI support SPNEGO, HADOOP, HBASE, 
> OOIZE.  
> SPARK UI should also support SPNEGO for security concern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19702) Add Suppress/Revive support to the Mesos Spark Dispatcher

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19702:


Assignee: Apache Spark

> Add Suppress/Revive support to the Mesos Spark Dispatcher
> -
>
> Key: SPARK-19702
> URL: https://issues.apache.org/jira/browse/SPARK-19702
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>Assignee: Apache Spark
>
> Due to the problem described here: 
> https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos 
> frameworks concurrently can result in starvation.  For example, running 10 
> dispatchers could result in 5 of them getting all the offers, even if they 
> have no jobs to launch.  We must implement explicit SUPPRESS and REVIVE calls 
> in the Spark Dispatcher to solve this problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19702) Add Suppress/Revive support to the Mesos Spark Dispatcher

2017-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879602#comment-15879602
 ] 

Apache Spark commented on SPARK-19702:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/17031

> Add Suppress/Revive support to the Mesos Spark Dispatcher
> -
>
> Key: SPARK-19702
> URL: https://issues.apache.org/jira/browse/SPARK-19702
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>
> Due to the problem described here: 
> https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos 
> frameworks concurrently can result in starvation.  For example, running 10 
> dispatchers could result in 5 of them getting all the offers, even if they 
> have no jobs to launch.  We must implement explicit SUPPRESS and REVIVE calls 
> in the Spark Dispatcher to solve this problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19702) Add Suppress/Revive support to the Mesos Spark Dispatcher

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19702:


Assignee: (was: Apache Spark)

> Add Suppress/Revive support to the Mesos Spark Dispatcher
> -
>
> Key: SPARK-19702
> URL: https://issues.apache.org/jira/browse/SPARK-19702
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>
> Due to the problem described here: 
> https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos 
> frameworks concurrently can result in starvation.  For example, running 10 
> dispatchers could result in 5 of them getting all the offers, even if they 
> have no jobs to launch.  We must implement explicit SUPPRESS and REVIVE calls 
> in the Spark Dispatcher to solve this problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory

2017-02-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879564#comment-15879564
 ] 

Saisai Shao commented on SPARK-19688:
-

I see, so we should exclude this configuration in checkpoint and make it 
re-configured after restarted.

> Spark on Yarn Credentials File set to different application directory
> -
>
> Key: SPARK-19688
> URL: https://issues.apache.org/jira/browse/SPARK-19688
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, YARN
>Affects Versions: 1.6.3
>Reporter: Devaraj Jonnadula
>Priority: Minor
>
> spark.yarn.credentials.file property is set to different application Id 
> instead of actual Application Id 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19688) Spark on Yarn Credentials File set to different application directory

2017-02-22 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-19688:

Component/s: DStreams

> Spark on Yarn Credentials File set to different application directory
> -
>
> Key: SPARK-19688
> URL: https://issues.apache.org/jira/browse/SPARK-19688
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, YARN
>Affects Versions: 1.6.3
>Reporter: Devaraj Jonnadula
>Priority: Minor
>
> spark.yarn.credentials.file property is set to different application Id 
> instead of actual Application Id 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-02-22 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879539#comment-15879539
 ] 

Mingjie Tang commented on SPARK-18454:
--

[~yunn] I leave some comments on the document. the build index over the input 
data would be very useful, if we do not shuffle the input data table. 

> Changes to improve Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19652) REST API does not perform user auth for individual apps

2017-02-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19652.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> REST API does not perform user auth for individual apps
> ---
>
> Key: SPARK-19652
> URL: https://issues.apache.org/jira/browse/SPARK-19652
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> (This goes back further than 2.0.0, btw.)
> The REST API currently only performs authorization at the root of the UI; 
> this works for live UIs, but not for the history server, where the root 
> allows everybody to read data. That means that currently any user can see any 
> application in the SHS through the REST API, when auth is enabled.
> Instead, the REST API should behave like the regular UI and perform 
> authentication at the app level too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19702) Add Suppress/Revive support to the Mesos Spark Dispatcher

2017-02-22 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-19702:
---

 Summary: Add Suppress/Revive support to the Mesos Spark Dispatcher
 Key: SPARK-19702
 URL: https://issues.apache.org/jira/browse/SPARK-19702
 Project: Spark
  Issue Type: New Feature
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Michael Gummelt


Due to the problem described here: 
https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos frameworks 
concurrently can result in starvation.  For example, running 10 dispatchers 
could result in 5 of them getting all the offers, even if they have no jobs to 
launch.  We must implement explicit SUPPRESS and REVIVE calls in the Spark 
Dispatcher to solve this problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19703) Add Suppress/Revive support to the Mesos Spark Driver

2017-02-22 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-19703:
---

 Summary: Add Suppress/Revive support to the Mesos Spark Driver
 Key: SPARK-19703
 URL: https://issues.apache.org/jira/browse/SPARK-19703
 Project: Spark
  Issue Type: New Feature
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Michael Gummelt


Due to the problem described here: 
https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos frameworks 
concurrently can result in starvation.  For example, running 10 jobs could 
result in 5 of them getting all the offers, even after they've launched all 
their executors.  This leads to starvation of the other jobs.  We must 
implement explicit SUPPRESS and REVIVE calls in the Spark Dispatcher to solve 
this problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19701) the `in` operator in pyspark is broken

2017-02-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19701:

Description: 
{code}
>>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md")
>>> linesWithSpark = textFile.filter("Spark" in textFile.value)
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, in 
__nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 
'or', '~' for 'not' when building DataFrame boolean expressions.
{code}

> the `in` operator in pyspark is broken
> --
>
> Key: SPARK-19701
> URL: https://issues.apache.org/jira/browse/SPARK-19701
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> {code}
> >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md")
> >>> linesWithSpark = textFile.filter("Spark" in textFile.value)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, 
> in __nonzero__
> raise ValueError("Cannot convert column into bool: please use '&' for 
> 'and', '|' for 'or', "
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19701) the `in` operator in pyspark is broken

2017-02-22 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19701:
---

 Summary: the `in` operator in pyspark is broken
 Key: SPARK-19701
 URL: https://issues.apache.org/jira/browse/SPARK-19701
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19459) ORC tables cannot be read when they contain char/varchar columns

2017-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879441#comment-15879441
 ] 

Apache Spark commented on SPARK-19459:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17030

> ORC tables cannot be read when they contain char/varchar columns
> 
>
> Key: SPARK-19459
> URL: https://issues.apache.org/jira/browse/SPARK-19459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> Reading from an ORC table which contains char/varchar columns can fail if the 
> table has been created using Spark. This is caused by the fact that spark 
> internally replaces char and varchar columns with a string column, this 
> causes the ORC reader to use the wrong reader, and that eventually causes a 
> ClassCastException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16617) Upgrade to Avro 1.8.x

2017-02-22 Thread Michael Heuer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879431#comment-15879431
 ] 

Michael Heuer commented on SPARK-16617:
---

Any thoughts as to what the Fix Version/s for this should be?

>From what I can see Apache Spark git HEAD has already bumped to parquet 
>version 1.8.2, and this may force the issue, since parquet 1.8.2 calls the new 
>method Schema.getLogicalType not present in 1.7.x versions of avro.

> Upgrade to Avro 1.8.x
> -
>
> Key: SPARK-16617
> URL: https://issues.apache.org/jira/browse/SPARK-16617
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben McCann
>
> Avro 1.8 makes Avro objects serializable so that you can easily have an RDD 
> containing Avro objects.
> See https://issues.apache.org/jira/browse/AVRO-1502



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2017-02-22 Thread Nicolas Drizard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879428#comment-15879428
 ] 

Nicolas Drizard commented on SPARK-12664:
-

I upvote for it as an important feature, is anyone currently working on it? 
[~yanboliang]?

Thanks!

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Yanbo Liang
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879397#comment-15879397
 ] 

Shixiong Zhu commented on SPARK-19644:
--

[~deenbandhu] Do you use Scala 2.10 or Scala 2.11?

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: Dominator_tree.png, heapdump.png, Path2GCRoot.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19554) YARN backend should use history server URL for tracking when UI is disabled

2017-02-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19554.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.2.0

> YARN backend should use history server URL for tracking when UI is disabled
> ---
>
> Key: SPARK-19554
> URL: https://issues.apache.org/jira/browse/SPARK-19554
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, if the app has disabled its UI, Spark does not set a tracking URL 
> in YARN. The UI is still available, even if with a lag, in the history 
> server, if it's configured. We should use that as the tracking URL in these 
> cases, instead of letting YARN show its default page for applications without 
> a UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19573) Make NaN/null handling consistent in approxQuantile

2017-02-22 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879387#comment-15879387
 ] 

Timothy Hunter commented on SPARK-19573:


I do not have too strong an opinion, as long as:
 1. we are consistent within Spark, or
 2. we follow the standard for numerical stuff (IEEE-754)

I am not sure what the standard is for SQL, though.


> Make NaN/null handling consistent in approxQuantile
> ---
>
> Key: SPARK-19573
> URL: https://issues.apache.org/jira/browse/SPARK-19573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> As discussed in https://github.com/apache/spark/pull/16776, this jira is used 
> to track the following issue:
> Multi-column version of approxQuantile drop the rows containing *any* 
> NaN/null, the results are not consistent with outputs of the single-version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19652) REST API does not perform user auth for individual apps

2017-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879283#comment-15879283
 ] 

Apache Spark commented on SPARK-19652:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17029

> REST API does not perform user auth for individual apps
> ---
>
> Key: SPARK-19652
> URL: https://issues.apache.org/jira/browse/SPARK-19652
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Marcelo Vanzin
>
> (This goes back further than 2.0.0, btw.)
> The REST API currently only performs authorization at the root of the UI; 
> this works for live UIs, but not for the history server, where the root 
> allows everybody to read data. That means that currently any user can see any 
> application in the SHS through the REST API, when auth is enabled.
> Instead, the REST API should behave like the regular UI and perform 
> authentication at the app level too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19648) Unable to access column containing '.' for approxQuantile function on DataFrame

2017-02-22 Thread John Compitello (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Compitello updated SPARK-19648:

Affects Version/s: 2.1.0

> Unable to access column containing '.' for approxQuantile function on 
> DataFrame
> ---
>
> Key: SPARK-19648
> URL: https://issues.apache.org/jira/browse/SPARK-19648
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Running spark in an ipython prompt on Mac OSX. 
>Reporter: John Compitello
>
> It seems that the approx quantiles method does not offer any way to access a 
> column with a period in string name. I am aware of the backtick solution, but 
> it does not work in this scenario.
> For example, let's say I have a column named 'va.x'. Passing approx quantiles 
> this string without backticks results in the following error:
> 'Cannot resolve column name '`va.x`' given input columns: '
> Note that backticks seem to have been automatically inserted, but it cannot 
> find column name regardless. 
> If I do include backticks, I get a different error. An 
> IllegalArgumentException is thrown as follows:
> "IllegalArgumentException: 'Field "`va.x`" does not exist."



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2017-02-22 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879205#comment-15879205
 ] 

Matt Cheah commented on SPARK-18278:


[~hkothari] I created SPARK-19700 to track the pluggable scheduler API design.

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-02-22 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-19700:
--

 Summary: Design an API for pluggable scheduler implementations
 Key: SPARK-19700
 URL: https://issues.apache.org/jira/browse/SPARK-19700
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Matt Cheah


One point that was brought up in discussing SPARK-18278 was that schedulers 
cannot easily be added to Spark without forking the whole project. The main 
reason is that much of the scheduler's behavior fundamentally depends on the 
CoarseGrainedSchedulerBackend class, which is not part of the public API of 
Spark and is in fact quite a complex module. As resource management and 
allocation continues evolves, Spark will need to be integrated with more 
cluster managers, but maintaining support for all possible allocators in the 
Spark project would be untenable. Furthermore, it would be impossible for Spark 
to support proprietary frameworks that are developed by specific users for 
their other particular use cases.

Therefore, this ticket proposes making scheduler implementations fully 
pluggable. The idea is that Spark will provide a Java/Scala interface that is 
to be implemented by a scheduler that is backed by the cluster manager of 
interest. The user can compile their scheduler's code into a JAR that is placed 
on the driver's classpath. Finally, as is the case in the current world, the 
scheduler implementation is selected and dynamically loaded depending on the 
user's provided master URL.

Determining the correct API is the most challenging problem. The current 
CoarseGrainedSchedulerBackend handles many responsibilities, some of which will 
be common across all cluster managers, and some which will be specific to a 
particular cluster manager. For example, the particular mechanism for creating 
the executor processes will differ between YARN and Mesos, but, once these 
executors have started running, the means to submit tasks to them over the 
Netty RPC is identical across the board.

We must also consider a plugin model and interface for submitting the 
application as well, because different cluster managers support different 
configuration options, and thus the driver must be bootstrapped accordingly. 
For example, in YARN mode the application and Hadoop configuration must be 
packaged and shipped to the distributed cache prior to launching the job. A 
prototype of a Kubernetes implementation starts a Kubernetes pod that runs the 
driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19666) Exception when calling createDataFrame with typed RDD

2017-02-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19666.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17013
[https://github.com/apache/spark/pull/17013]

> Exception when calling createDataFrame with typed RDD
> -
>
> Key: SPARK-19666
> URL: https://issues.apache.org/jira/browse/SPARK-19666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Colin Breame
> Fix For: 2.2.0
>
>
> The following code:
> {code}
> var tmp = sc.parallelize(Seq(new __Message()))
> val spark = SparkSession.builder().getOrCreate()
> var df = spark.createDataFrame(tmp, classOf[__Message])
> {code}
> Produces this error message.
> {code}
> Exception in thread "main" java.lang.NullPointerException
>   at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

[jira] [Assigned] (SPARK-19666) Exception when calling createDataFrame with typed RDD

2017-02-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19666:
---

Assignee: Hyukjin Kwon

> Exception when calling createDataFrame with typed RDD
> -
>
> Key: SPARK-19666
> URL: https://issues.apache.org/jira/browse/SPARK-19666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Colin Breame
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> The following code:
> {code}
> var tmp = sc.parallelize(Seq(new __Message()))
> val spark = SparkSession.builder().getOrCreate()
> var df = spark.createDataFrame(tmp, classOf[__Message])
> {code}
> Produces this error message.
> {code}
> Exception in thread "main" java.lang.NullPointerException
>   at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> 

[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory

2017-02-22 Thread Devaraj Jonnadula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879175#comment-15879175
 ] 

Devaraj Jonnadula commented on SPARK-19688:
---

When Spark Application is restarted  spark.yarn.credentials.file is set to 
hdfs://node/user/*/.sparkStaging/application_someotherApplicationId/credentials-d8c33609-72f9-4770-9e50-aab848424e62

Streaming Application with check-pointing enabled.

> Spark on Yarn Credentials File set to different application directory
> -
>
> Key: SPARK-19688
> URL: https://issues.apache.org/jira/browse/SPARK-19688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3
>Reporter: Devaraj Jonnadula
>Priority: Minor
>
> spark.yarn.credentials.file property is set to different application Id 
> instead of actual Application Id 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19699) createOrReplaceTable does not always replace an existing table of the same name

2017-02-22 Thread Barry Becker (JIRA)
Barry Becker created SPARK-19699:


 Summary: createOrReplaceTable does not always replace an existing 
table of the same name
 Key: SPARK-19699
 URL: https://issues.apache.org/jira/browse/SPARK-19699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Barry Becker
Priority: Minor


There are cases when dataframe.createOrReplaceTempView does not replace an 
existing table with the same name.
Please also refer to my [related stack-overflow 
post|http://stackoverflow.com/questions/42371690/in-spark-2-1-how-come-the-dataframe-createoreplacetemptable-does-not-replace-an].

To reproduce, do
{code}
df.collect()
df.createOrReplaceTempView("foo1")
df.sqlContext.cacheTable("foo1")
{code}

with one dataframe, and then do exactly the same thing with a different 
dataframe. Then look in the storage tab in the spark UI and see multiple 
entries for "foo1" in the "RDD Name" column.

Maybe I am misunderstanding, but this causes 2 apparent problems
1) How do you know which table will be retrieved with sqlContext.table("foo1") ?
2) The duplicate entries represent a memory leak. 
  I have tried calling dropTempTable(existingName) first, but then have 
occasionally seen a FAILFAST error when trying to use the table. It's as if the 
dropTempTable is not synchronous, but maybe I am doing something wrong.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19616) weightCol and aggregationDepth should be improved for some SparkR APIs

2017-02-22 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19616.
--
  Resolution: Fixed
Assignee: Miao Wang
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> weightCol and aggregationDepth should be improved for some SparkR APIs 
> ---
>
> Key: SPARK-19616
> URL: https://issues.apache.org/jira/browse/SPARK-19616
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> When doing SPARK-19456, we found that "" should be consider a NULL column 
> name and should not be set. aggregationDepth should be exposed as an expert 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-22 Thread Jisoo Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879068#comment-15879068
 ] 

Jisoo Kim commented on SPARK-19698:
---

I don't think it's only limited to Mesos coarse-grained executor. 
https://github.com/metamx/spark/pull/25 this might be a solution, and we're 
doing more investigating/testing. 

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-22 Thread Jisoo Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879068#comment-15879068
 ] 

Jisoo Kim edited comment on SPARK-19698 at 2/22/17 7:43 PM:


I don't think it's only limited to Mesos coarse-grained executor. 
https://github.com/metamx/spark/pull/25 might be a solution, and we're doing 
more investigating/testing. 


was (Author: jisookim0...@gmail.com):
I don't think it's only limited to Mesos coarse-grained executor. 
https://github.com/metamx/spark/pull/25 this might be a solution, and we're 
doing more investigating/testing. 

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-22 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878959#comment-15878959
 ] 

Charles Allen edited comment on SPARK-19698 at 2/22/17 6:46 PM:


I *think* this is due to the driver not having the concept of a "critical 
section" for code being executed, meaning that you can't declare a portion of 
the code being run as "I'm in a non-atomic or critical command region, please 
let me finish" 


was (Author: drcrallen):
I *think* this is due to the driver not having the concept of a "critical 
section" for code being executed, meaning that you can't declare a portion of 
the code being run as "I'm in a non-idempotent command region, please let me 
finish" 

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-22 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878959#comment-15878959
 ] 

Charles Allen commented on SPARK-19698:
---

I *think* this is due to the driver not having the concept of a "critical 
section" for code being executed, meaning that you can't declare a portion of 
the code being run as "I'm in a non-idempotent command region, please let me 
finish" 

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-22 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-19698:
--
Summary: Race condition in stale attempt task completion vs current attempt 
task completion when task is doing persistent state changes  (was: Race 
condition in stale attempt task completion vs current attempt task completion)

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion

2017-02-22 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878927#comment-15878927
 ] 

Charles Allen commented on SPARK-19698:
---

[~jisookim0...@gmail.com] has been investigating this on our side.

> Race condition in stale attempt task completion vs current attempt task 
> completion
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion

2017-02-22 Thread Charles Allen (JIRA)
Charles Allen created SPARK-19698:
-

 Summary: Race condition in stale attempt task completion vs 
current attempt task completion
 Key: SPARK-19698
 URL: https://issues.apache.org/jira/browse/SPARK-19698
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Core
Affects Versions: 2.0.0
Reporter: Charles Allen


We have encountered a strange scenario in our production environment. Below is 
the best guess we have right now as to what's going on.

Potentially, the final stage of a job has a failure in one of the tasks (such 
as OOME on the executor) which can cause tasks for that stage to be relaunched 
in a second attempt.

https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155

keeps track of which tasks have been completed, but does NOT keep track of 
which attempt those tasks were completed in. As such, we have encountered a 
scenario where a particular task gets executed twice in different stage 
attempts, and the DAGScheduler does not consider if the second attempt is still 
running. This means if the first task attempt succeeded, the second attempt can 
be cancelled part-way through its run cycle if all other tasks (including the 
prior failed) are completed successfully.

What this means is that if a task is manipulating some state somewhere (for 
example: a upload-to-temporary-file-location, then delete-then-move on an 
underlying s3n storage implementation) the driver can improperly shutdown the 
running (2nd attempt) task between state manipulations, leaving the persistent 
state in a bad state since the 2nd attempt never got to complete its 
manipulations, and was terminated prematurely at some arbitrary point in its 
state change logic (ex: finished the delete but not the move).

This is using the mesos coarse grained executor. It is unclear if this behavior 
is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()

2017-02-22 Thread Michael Heuer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878884#comment-15878884
 ] 

Michael Heuer commented on SPARK-19697:
---

Sorry about all the description edits.  Thank you for linking this duplicate 
issue to a parent issue.


> NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
> --
>
> Key: SPARK-19697
> URL: https://issues.apache.org/jira/browse/SPARK-19697
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.1.0, Scala version 2.11.8, Java 
> HotSpot(TM) 64-Bit Server VM, 1.8.0_60
>Reporter: Michael Heuer
>
> In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
> dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at 
> runtime on various Spark versions, including 2.1.0.
> pom.xml:
> {code:xml}
>   
> 1.8
> 1.8.1
> 2.11.8
> 2.11
> 2.1.0
> 1.8.2
> 
>   
> 
>   
> org.apache.parquet
> parquet-avro
> ${parquet.version}
>   
> {code}
> Example using spark-submit (called via adam-submit below):
> {code}
> $ ./bin/adam-submit vcf2adam \
>   adam-core/src/test/resources/small.vcf \
>   small.adam
> ...
> java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115)
>   at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
>   at 
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The issue can be reproduced from this pull request
> https://github.com/bigdatagenomics/adam/pull/1360
> and is reported as Jenkins CI test failures, e.g.
> https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810
> d...@spark.apache.org mailing list archive thread
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-VOTE-Release-Apache-Parquet-1-8-2-RC1-tp20711p20720.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17280) Flaky test: org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite and JavaDirectKafkaStreamSuite.testKafkaStream

2017-02-22 Thread Armin Braun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armin Braun resolved SPARK-17280.
-
Resolution: Fixed

closing this, can't find any recent examples of this on Jenkins and haven't 
experienced this locally either as of late.
Also tried reproducing this running 1k+ loops of all the Kafka0.10_2.11 tests 
with 3 forks in parallel without issues.

> Flaky test: org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite and 
> JavaDirectKafkaStreamSuite.testKafkaStream
> 
>
> Key: SPARK-17280
> URL: https://issues.apache.org/jira/browse/SPARK-17280
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Tests
>Reporter: Yin Huai
>
> https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.2/1793
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/1793/
> {code}
> org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream
> Error Message
> assertion failed: Partition [topic1, 0] metadata not propagated after timeout
> Stacktrace
> java.util.concurrent.TimeoutException: assertion failed: Partition [topic1, 
> 0] metadata not propagated after timeout
>   at 
> org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.createTopicAndSendData(JavaDirectKafkaStreamSuite.java:176)
>   at 
> org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream(JavaDirectKafkaStreamSuite.java:74)
> {code}
> {code}
> org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite.testKafkaRDD
> Error Message
> Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most 
> recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): 
> java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-java-test-consumer--363965267-1472280538438 topic2 0 0 after 
> polling for 512
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
>  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
>  at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>  at org.apache.spark.scheduler.Task.run(Task.scala:86)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> Stacktrace
> org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most 
> recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): 
> java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-java-test-consumer--363965267-1472280538438 topic2 0 0 after 
> polling for 512
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)

[jira] [Resolved] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()

2017-02-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19697.
---
Resolution: Duplicate

Yes, as you note it's version mismatch. Spark doesn't use 1.8. 

> NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
> --
>
> Key: SPARK-19697
> URL: https://issues.apache.org/jira/browse/SPARK-19697
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.1.0, Scala version 2.11.8, Java 
> HotSpot(TM) 64-Bit Server VM, 1.8.0_60
>Reporter: Michael Heuer
>
> In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
> dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at 
> runtime on various Spark versions, including 2.1.0.
> pom.xml:
> {code:xml}
>   
> 1.8
> 1.8.1
> 2.11.8
> 2.11
> 2.1.0
> 1.8.2
> 
>   
> 
>   
> org.apache.parquet
> parquet-avro
> ${parquet.version}
>   
> {code}
> Example using spark-submit (called via adam-submit below):
> {code}
> $ ./bin/adam-submit vcf2adam \
>   adam-core/src/test/resources/small.vcf \
>   small.adam
> ...
> java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115)
>   at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
>   at 
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The issue can be reproduced from this pull request
> https://github.com/bigdatagenomics/adam/pull/1360
> and is reported as Jenkins CI test failures, e.g.
> https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810
> d...@spark.apache.org mailing list archive thread
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-VOTE-Release-Apache-Parquet-1-8-2-RC1-tp20711p20720.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()

2017-02-22 Thread Michael Heuer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-19697:
--
Description: 
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at 
runtime on various Spark versions, including 2.1.0.

pom.xml:
{code:xml}
  
1.8
1.8.1
2.11.8
2.11
2.1.0
1.8.2

  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
{code}

Example using spark-submit (called via adam-submit below):
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115)
  at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117)
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
  at 
org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
{code}

The issue can be reproduced from this pull request
https://github.com/bigdatagenomics/adam/pull/1360

and is reported as Jenkins CI test failures, e.g.
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810

d...@spark.apache.org mailing list archive thread
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-VOTE-Release-Apache-Parquet-1-8-2-RC1-tp20711p20720.html

  was:
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at 
runtime on various Spark versions, including 2.1.0.

pom.xml:
{code:xml}
  
1.8
1.8.1
2.11.8
2.11
2.1.0
1.8.2

  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
{code}

Example using spark-submit (called via adam-submit below):
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 

[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()

2017-02-22 Thread Michael Heuer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-19697:
--
Description: 
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at 
runtime on various Spark versions, including 2.1.0.

pom.xml:
{code:xml}
  
1.8
1.8.1
2.11.8
2.11
2.1.0
1.8.2

  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
{code}

Example using spark-submit (called via adam-submit below):
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115)
  at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117)
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
  at 
org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
{code}

The issue can be reproduced from this pull request
https://github.com/bigdatagenomics/adam/pull/1360

and is reported as Jenkins CI test failures
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810

d...@spark.apache.org mailing list archive thread
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-VOTE-Release-Apache-Parquet-1-8-2-RC1-tp20711p20720.html

  was:
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at 
runtime on various Spark versions, including 2.1.0.

pom.xml:
{code:xml}
  
1.8
1.8.1
2.11.8
2.11
2.1.0
1.8.2

  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
{code}

Example using `spark-submit` (called via `adam-submit` below):
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 

[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()

2017-02-22 Thread Michael Heuer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-19697:
--
Description: 
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at 
runtime on various Spark versions, including 2.1.0.

pom.xml:
{code:xml}
  
1.8
1.8.1
2.11.8
2.11
2.1.0
1.8.2

  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
{code}

Example using `spark-submit` (called via `adam-submit` below):
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115)
  at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117)
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
  at 
org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
{code}

The issue can be reproduced from this pull request
https://github.com/bigdatagenomics/adam/pull/1360

and is reported as Jenkins CI test failures
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810

  was:
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on {{parquet-avro}} version 1.8.2 results in 
{{NoSuchMethodException}}s at runtime on various Spark versions, including 
2.1.0.

pom.xml:
{code:xml}
  
1.8
1.8.1
2.11.8
2.11
2.1.0
1.8.2

  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
{code}

Example using `spark-submit` (called via `adam-submit` below)
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 

[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()

2017-02-22 Thread Michael Heuer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-19697:
--
Description: 
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s 
at runtime on various Spark versions, including 2.1.0.

pom.xml:
{code:xml}
  
1.8
1.8.1
2.11.8
2.11
2.1.0
1.8.2

  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
{code}

Example using `spark-submit` (called via `adam-submit` below)
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115)
  at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117)
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
  at 
org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
{code}

The issue can be reproduced from this pull request
https://github.com/bigdatagenomics/adam/pull/1360

and is reported as Jenkins CI test failures
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810

  was:
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s 
at runtime on various Spark versions, including 2.1.0.

pom.xml:
{code:xml}
  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
  
org.apache.parquet

parquet-scala_2.10
${parquet.version}

  
org.scala-lang
scala-library
  

  
{code}

Example using `spark-submit` (called via `adam-submit` below)
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
at 

[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()

2017-02-22 Thread Michael Heuer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-19697:
--
Description: 
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on {{parquet-avro}} version 1.8.2 results in 
{{NoSuchMethodException}}s at runtime on various Spark versions, including 
2.1.0.

pom.xml:
{code:xml}
  
1.8
1.8.1
2.11.8
2.11
2.1.0
1.8.2

  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
{code}

Example using `spark-submit` (called via `adam-submit` below)
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115)
  at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117)
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
  at 
org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
{code}

The issue can be reproduced from this pull request
https://github.com/bigdatagenomics/adam/pull/1360

and is reported as Jenkins CI test failures
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810

  was:
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s 
at runtime on various Spark versions, including 2.1.0.

pom.xml:
{code:xml}
  
1.8
1.8.1
2.11.8
2.11
2.1.0
1.8.2

  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
{code}

Example using `spark-submit` (called via `adam-submit` below)
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
  at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
  at 

[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()

2017-02-22 Thread Michael Heuer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-19697:
--
Description: 
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s 
at runtime on various Spark versions, including 2.1.0.

pom.xml:
{code:xml}
  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
  
org.apache.parquet

parquet-scala_2.10
${parquet.version}

  
org.scala-lang
scala-library
  

  
{code}

Example using `spark-submit` (called via `adam-submit` below)
{code}
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
at 
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115)
at 
org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
at 
org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

The issue can be reproduced from this pull request
https://github.com/bigdatagenomics/adam/pull/1360

and is reported as Jenkins CI test failures
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810

  was:
In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s 
at runtime on various Spark versions, including 2.1.0.

pom.xml:
{{
  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
  
org.apache.parquet

parquet-scala_2.10
${parquet.version}

  
org.scala-lang
scala-library
  

  
}}

Example using `spark-submit` (called via `adam-submit` below)
{{
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
at 

[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()

2017-02-22 Thread Michael Heuer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-19697:
--
Environment: Apache Spark 2.1.0, Scala version 2.11.8, Java HotSpot(TM) 
64-Bit Server VM, 1.8.0_60  (was: {{
$ spark-submit --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_60
Branch 
Compiled by user jenkins on 2016-12-16T02:04:48Z
Revision 
Url 
Type --help for more information.
}}

)

> NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
> --
>
> Key: SPARK-19697
> URL: https://issues.apache.org/jira/browse/SPARK-19697
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 2.1.0
> Environment: Apache Spark 2.1.0, Scala version 2.11.8, Java 
> HotSpot(TM) 64-Bit Server VM, 1.8.0_60
>Reporter: Michael Heuer
>
> In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
> dependency on `parquet-avro` version 1.8.2 results in 
> `NoSuchMethodException`s at runtime on various Spark versions, including 
> 2.1.0.
> pom.xml:
> {{
>   
> 
>   
> org.apache.parquet
> parquet-avro
> ${parquet.version}
>   
>   
> org.apache.parquet
> 
> parquet-scala_2.10
> ${parquet.version}
> 
>   
> org.scala-lang
> scala-library
>   
> 
>   
> }}
> Example using `spark-submit` (called via `adam-submit` below)
> {{
> $ ./bin/adam-submit vcf2adam \
>   adam-core/src/test/resources/small.vcf \
>   small.adam
> ...
> java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
>   at 
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115)
>   at 
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
>   at 
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> }}
> The issue can be reproduced from this pull request
> https://github.com/bigdatagenomics/adam/pull/1360
> and is reported as Jenkins CI test failures
> https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()

2017-02-22 Thread Michael Heuer (JIRA)
Michael Heuer created SPARK-19697:
-

 Summary: NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
 Key: SPARK-19697
 URL: https://issues.apache.org/jira/browse/SPARK-19697
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 2.1.0
 Environment: {{
$ spark-submit --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_60
Branch 
Compiled by user jenkins on 2016-12-16T02:04:48Z
Revision 
Url 
Type --help for more information.
}}


Reporter: Michael Heuer


In a downstream project (https://github.com/bigdatagenomics/adam), adding a 
dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s 
at runtime on various Spark versions, including 2.1.0.

pom.xml:
{{
  

  
org.apache.parquet
parquet-avro
${parquet.version}
  
  
org.apache.parquet

parquet-scala_2.10
${parquet.version}

  
org.scala-lang
scala-library
  

  
}}

Example using `spark-submit` (called via `adam-submit` below)
{{
$ ./bin/adam-submit vcf2adam \
  adam-core/src/test/resources/small.vcf \
  small.adam
...
java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227)
at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124)
at 
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115)
at 
org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
at 
org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
}}

The issue can be reproduced from this pull request
https://github.com/bigdatagenomics/adam/pull/1360

and is reported as Jenkins CI test failures
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19405) Add support to KinesisUtils for cross-account Kinesis reads via STS

2017-02-22 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19405.
-
   Resolution: Fixed
 Assignee: Adam Budde
Fix Version/s: 2.2.0

Resolved with: https://github.com/apache/spark/pull/16744

> Add support to KinesisUtils for cross-account Kinesis reads via STS
> ---
>
> Key: SPARK-19405
> URL: https://issues.apache.org/jira/browse/SPARK-19405
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Adam Budde
>Assignee: Adam Budde
>Priority: Minor
> Fix For: 2.2.0
>
>
> h1. Summary
> Enable KinesisReceiver to utilize STSAssumeRoleSessionCredentialsProvider 
> when setting up the Kinesis Client Library in order to enable secure 
> cross-account Kinesis stream reads managed by AWS Simple Token Service (STS)
> h1. Details
> Spark's KinesisReceiver implementation utilizes the Kinesis Client Library in 
> order to allow users to write Spark Streaming jobs that operate on Kinesis 
> data. The KCL uses a few AWS services under the hood in order to provide 
> checkpointed, load-balanced processing of the underlying data in a Kinesis 
> stream.  Running the KCL requires permissions to be set up for the following 
> AWS resources.
> * AWS Kinesis for reading stream data
> * AWS DynamoDB for storing KCL shared state in tables
> * AWS CloudWatch for logging KCL metrics
> The KinesisUtils.createStream() API allows users to authenticate to these 
> services either by specifying an explicit AWS access key/secret key 
> credential pair or by using the default credential provider chain. This 
> supports authorizing to the three AWS services using either an AWS keypair 
> (either provided explicitly or parsed from environment variables, etc.):
> !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/KeypairOnly.png!
> Or the IAM instance profile (when running on EC2):
> !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/InstanceProfileOnly.png!
> AWS users often need to access resources across separate accounts. This could 
> be done in order to consume data produced by another organization or from a 
> service running in another account for resource isolation purposes. AWS 
> Simple Token Service (STS) provides a secure way to authorize cross-account 
> resource access by using temporary sessions to assuming an IAM role in the 
> AWS account with the resources being accessed.
> The [IAM 
> documentation|http://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html]
>  covers the specifics of how cross account IAM role assumption works in much 
> greater detail, but if an actor in account A wanted to read from a Kinesis 
> stream in account B the general steps required would look something like this:
> * An IAM role is added to account B with read permissions for the Kinesis 
> stream
> ** Trust policy is configured to allow account A to assume the role 
> * Actor in account A uses its own long-lived credentials to tell STS to 
> assume the role in account B
> * STS returns temporary credentials with permission to read from the stream 
> in account B
> Applied to KinesisReceiver and the KCL, we could use a keypair as our 
> long-lived credentials to authenticate to STS and assume an external role 
> with the necessary KCL permissions:
> !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/STSKeypair.png!
> Or the instance profile as long-lived credentials:
> !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/STSInstanceProfile.png!
> The STSAssumeRoleSessionCredentialsProvider implementation of the 
> AWSCredentialsProviderChain interface from the AWS SDK abstracts all of the 
> management of the temporary session credentials away from the user. 
> STSAssumeRoleSessionCredentialsProvider simply needs the ARN of the AWS role 
> to be assumed, a session name for STS labeling purposes, an optional session 
> external ID and long-lived credentials to use for authenticating with the STS 
> service itself.
> Supporting cross-account Kinesis access via STS requires supplying the 
> following additional configuration parameters:
> * ARN of IAM role to assume in external account
> * A name to apply to the STS session
> * (optional) An IAM external ID to validate the assumed role against
> The STSAssumeRoleSessionCredentialsProvider implementation of the 
> AWSCredentialsProvider interface takes these parameters as input and 
> abstracts away all of the lifecycle management for the temporary session 
> credentials. Ideally, users could simply supply an AWSCredentialsProvider 
> instance as an argument 

[jira] [Comment Edited] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2017-02-22 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878623#comment-15878623
 ] 

Cody Koeninger edited comment on SPARK-19680 at 2/22/17 4:25 PM:
-

The issue here is likely that you have lost data (because of retention 
expiration) between the time the batch was defined on the driver, and the time 
the executor attempted to process the batch.  Having executor consumers obey 
auto offset reset would result in silent data loss, which is a bad thing.

There's a more detailed description of the semantic issues around this for 
kafka in KAFKA-3370 and for structured streaming kafka in SPARK-17937

If you've got really aggressive retention settings and are having trouble 
getting a stream started, look at specifying earliest + some margin on startup 
as a workaround.  If you're having this trouble after a stream has been running 
for a while, you need more retention or smaller batches.




was (Author: c...@koeninger.org):
The issue here is likely that you have lost data (because of retention 
expiration) between the time the batch was defined on the driver, and the time 
the executor attempted to process the batch.  Having executor consumers obey 
auto offset reset would result in silent data loss, which is a bad thing.

There's a more detailed description of the semantic issues around this for 
kafka in KAFKA-3370 and for structured streaming kafka in SPARK-17937



> Offsets out of range with no configured reset policy for partitions
> ---
>
> Key: SPARK-19680
> URL: https://issues.apache.org/jira/browse/SPARK-19680
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Schakmann Rene
>
> I'm using spark streaming with kafka to acutally create a toplist. I want to 
> read all the messages in kafka. So I set
>"auto.offset.reset" -> "earliest"
> Nevertheless when I start the job on our spark cluster it is not working I 
> get:
> Error:
> {code:title=error.log|borderStyle=solid}
>   Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
> most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
> executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {SearchEvents-2=161803385}
> {code}
> This is somehow wrong because I did set the auto.offset.reset property
> Setup:
> Kafka Parameter:
> {code:title=Config.Scala|borderStyle=solid}
>   def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
> Object] = {
> Map(
>   "bootstrap.servers" -> 
> properties.getProperty("kafka.bootstrap.servers"),
>   "group.id" -> properties.getProperty("kafka.consumer.group"),
>   "auto.offset.reset" -> "earliest",
>   "spark.streaming.kafka.consumer.cache.enabled" -> "false",
>   "enable.auto.commit" -> "false",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
>   }
> {code}
> Job:
> {code:title=Job.Scala|borderStyle=solid}
>   def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
> Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
> Broadcast[KafkaSink[TopList]]): Unit = {
> getFilteredStream(stream.map(_.value()), windowDuration, 
> slideDuration).foreachRDD(rdd => {
>   val topList = new TopList
>   topList.setCreated(new Date())
>   topList.setTopListEntryList(rdd.take(TopListLength).toList)
>   CurrentLogger.info("TopList length: " + 
> topList.getTopListEntryList.size().toString)
>   kafkaSink.value.send(SendToTopicName, topList)
>   CurrentLogger.info("Last Run: " + System.currentTimeMillis())
> })
>   }
>   def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, 
> slideDuration: Int): DStream[TopListEntry] = {
> val Mapper = MapperObject.readerFor[SearchEventDTO]
> result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s))
>   .filter(s => s != null && s.getSearchRequest != null && 
> s.getSearchRequest.getSearchParameters != null && s.getVertical == 
> Vertical.BAP && 
> s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName))
>   .map(row => {
> val name = 
> row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase()
> (name, new TopListEntry(name, 1, row.getResultCount))
>   })
>   .reduceByKeyAndWindow(
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + 
> b.getMeanSearchHits),
> (a: TopListEntry, b: TopListEntry) => new 

[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2017-02-22 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878623#comment-15878623
 ] 

Cody Koeninger commented on SPARK-19680:


The issue here is likely that you have lost data (because of retention 
expiration) between the time the batch was defined on the driver, and the time 
the executor attempted to process the batch.  Having executor consumers obey 
auto offset reset would result in silent data loss, which is a bad thing.

There's a more detailed description of the semantic issues around this for 
kafka in KAFKA-3370 and for structured streaming kafka in SPARK-17937



> Offsets out of range with no configured reset policy for partitions
> ---
>
> Key: SPARK-19680
> URL: https://issues.apache.org/jira/browse/SPARK-19680
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Schakmann Rene
>
> I'm using spark streaming with kafka to acutally create a toplist. I want to 
> read all the messages in kafka. So I set
>"auto.offset.reset" -> "earliest"
> Nevertheless when I start the job on our spark cluster it is not working I 
> get:
> Error:
> {code:title=error.log|borderStyle=solid}
>   Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
> most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
> executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {SearchEvents-2=161803385}
> {code}
> This is somehow wrong because I did set the auto.offset.reset property
> Setup:
> Kafka Parameter:
> {code:title=Config.Scala|borderStyle=solid}
>   def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
> Object] = {
> Map(
>   "bootstrap.servers" -> 
> properties.getProperty("kafka.bootstrap.servers"),
>   "group.id" -> properties.getProperty("kafka.consumer.group"),
>   "auto.offset.reset" -> "earliest",
>   "spark.streaming.kafka.consumer.cache.enabled" -> "false",
>   "enable.auto.commit" -> "false",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
>   }
> {code}
> Job:
> {code:title=Job.Scala|borderStyle=solid}
>   def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
> Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
> Broadcast[KafkaSink[TopList]]): Unit = {
> getFilteredStream(stream.map(_.value()), windowDuration, 
> slideDuration).foreachRDD(rdd => {
>   val topList = new TopList
>   topList.setCreated(new Date())
>   topList.setTopListEntryList(rdd.take(TopListLength).toList)
>   CurrentLogger.info("TopList length: " + 
> topList.getTopListEntryList.size().toString)
>   kafkaSink.value.send(SendToTopicName, topList)
>   CurrentLogger.info("Last Run: " + System.currentTimeMillis())
> })
>   }
>   def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, 
> slideDuration: Int): DStream[TopListEntry] = {
> val Mapper = MapperObject.readerFor[SearchEventDTO]
> result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s))
>   .filter(s => s != null && s.getSearchRequest != null && 
> s.getSearchRequest.getSearchParameters != null && s.getVertical == 
> Vertical.BAP && 
> s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName))
>   .map(row => {
> val name = 
> row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase()
> (name, new TopListEntry(name, 1, row.getResultCount))
>   })
>   .reduceByKeyAndWindow(
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + 
> b.getMeanSearchHits),
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - 
> b.getMeanSearchHits),
> Minutes(windowDuration),
> Seconds(slideDuration))
>   .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 200L)
>   .map(row => (row._2.getSearchCount, row._2))
>   .transform(rdd => rdd.sortByKey(ascending = false))
>   .map(row => new TopListEntry(row._2.getKeyword, row._2.getSearchCount, 
> row._2.getMeanSearchHits / row._2.getSearchCount))
>   }
>   def main(properties: Properties): Unit = {
> val sparkSession = SparkUtil.getDefaultSparkSession(properties, TaskName)
> val kafkaSink = 
> sparkSession.sparkContext.broadcast(KafkaSinkUtil.apply[TopList](SparkUtil.getDefaultSparkProperties(properties)))
> val kafkaParams: Map[String, Object] = 
> 

[jira] [Commented] (SPARK-19687) Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, kindly please help us with any examples.

2017-02-22 Thread Praveen Tallapudi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878613#comment-15878613
 ] 

Praveen Tallapudi commented on SPARK-19687:
---

I tried but i got a delivery failure, Sorry to trouble you. I really need to 
know the solution exist or not, thats why.  :-)

> Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, 
> kindly please help us with any examples.
> -
>
> Key: SPARK-19687
> URL: https://issues.apache.org/jira/browse/SPARK-19687
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Praveen Tallapudi
>
> Dear Team,
> I am little new to Scala development and trying to find the solution for the 
> below. Please forgive me if this is not the correct place to post this 
> question.
> I am trying to insert data from a data frame into postgres table.
>  
> Dataframe Schema:
> root
> |-- ID: string (nullable = true)
> |-- evtInfo: struct (nullable = true)
> ||-- @date: string (nullable = true)
> ||-- @time: string (nullable = true)
> ||-- @timeID: string (nullable = true)
> ||-- TranCode: string (nullable = true)
> ||-- custName: string (nullable = true)
> ||-- evtInfo: array (nullable = true)
> |||-- element: string (containsNull = true)
> ||-- Type: string (nullable = true)
> ||-- opID: string (nullable = true)
> ||-- tracNbr: string (nullable = true)
>  
>  
> DataBase Table Schema:
> CREATE TABLE public.test
> (
>id bigint NOT NULL,   
>evtInfo jsonb NOT NULL,
>evt_val bigint NOT NULL
> )
>  
> When I use dataFrame_toSave.write.mode(SaveMode.Append).jdbc(dbUrl, 
> "public.test", dbPropForDFtoSave) to save the data, I am seeing the below 
> error.
>  
> Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC 
> type for 
> struct<@dateEvt:string,@timeEvt:string,@timeID:string,CICSTranCode:string,custName:string,evtInfo:array,evtType:string,operID:string,trackingNbr:string>
>  
> Can you please suggest the best approach to save the data frame into the 
> posgres JSONB table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16625) Oracle JDBC table creation fails with ORA-00902: invalid datatype

2017-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878604#comment-15878604
 ] 

Sean Owen commented on SPARK-16625:
---

Does anyone have a view on whether this is OK to back-port to 2.0.x or 1.6.x? 

> Oracle JDBC table creation fails with ORA-00902: invalid datatype
> -
>
> Key: SPARK-16625
> URL: https://issues.apache.org/jira/browse/SPARK-16625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Daniel Darabos
>Assignee: Yuming Wang
> Fix For: 2.1.0
>
>
> Unfortunately I know very little about databases, but I figure this is a bug.
> I have a DataFrame with the following schema: 
> {noformat}
> StructType(StructField(dst,StringType,true), StructField(id,LongType,true), 
> StructField(src,StringType,true))
> {noformat}
> I am trying to write it to an Oracle database like this:
> {code:java}
> String url = "jdbc:oracle:thin:root/rootroot@:1521:db";
> java.util.Properties p = new java.util.Properties();
> p.setProperty("driver", "oracle.jdbc.OracleDriver");
> df.write().mode("overwrite").jdbc(url, "my_table", p);
> {code}
> And I get:
> {noformat}
> Exception in thread "main" java.sql.SQLSyntaxErrorException: ORA-00902: 
> invalid datatype
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1108)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:541)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:264)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:598)
>   at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:213)
>   at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:26)
>   at 
> oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1241)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1558)
>   at 
> oracle.jdbc.driver.OracleStatement.executeUpdateInternal(OracleStatement.java:2498)
>   at 
> oracle.jdbc.driver.OracleStatement.executeUpdate(OracleStatement.java:2431)
>   at 
> oracle.jdbc.driver.OracleStatementWrapper.executeUpdate(OracleStatementWrapper.java:975)
>   at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:302)
> {noformat}
> The Oracle server I am running against is the one I get on Amazon RDS for 
> engine type {{oracle-se}}. The same code (with the right driver) against the 
> RDS instance with engine type {{MySQL}} works.
> The error message is the same as in 
> https://issues.apache.org/jira/browse/SPARK-12941. Could it be that {{Long}} 
> is also translated into the wrong data type? Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19687) Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, kindly please help us with any examples.

2017-02-22 Thread Praveen Tallapudi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878565#comment-15878565
 ] 

Praveen Tallapudi commented on SPARK-19687:
---

i tried sending it to u...@spark.apache.org , it is failed and got a failure 
notice. :-)


> Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, 
> kindly please help us with any examples.
> -
>
> Key: SPARK-19687
> URL: https://issues.apache.org/jira/browse/SPARK-19687
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Praveen Tallapudi
>
> Dear Team,
> I am little new to Scala development and trying to find the solution for the 
> below. Please forgive me if this is not the correct place to post this 
> question.
> I am trying to insert data from a data frame into postgres table.
>  
> Dataframe Schema:
> root
> |-- ID: string (nullable = true)
> |-- evtInfo: struct (nullable = true)
> ||-- @date: string (nullable = true)
> ||-- @time: string (nullable = true)
> ||-- @timeID: string (nullable = true)
> ||-- TranCode: string (nullable = true)
> ||-- custName: string (nullable = true)
> ||-- evtInfo: array (nullable = true)
> |||-- element: string (containsNull = true)
> ||-- Type: string (nullable = true)
> ||-- opID: string (nullable = true)
> ||-- tracNbr: string (nullable = true)
>  
>  
> DataBase Table Schema:
> CREATE TABLE public.test
> (
>id bigint NOT NULL,   
>evtInfo jsonb NOT NULL,
>evt_val bigint NOT NULL
> )
>  
> When I use dataFrame_toSave.write.mode(SaveMode.Append).jdbc(dbUrl, 
> "public.test", dbPropForDFtoSave) to save the data, I am seeing the below 
> error.
>  
> Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC 
> type for 
> struct<@dateEvt:string,@timeEvt:string,@timeID:string,CICSTranCode:string,custName:string,evtInfo:array,evtType:string,operID:string,trackingNbr:string>
>  
> Can you please suggest the best approach to save the data frame into the 
> posgres JSONB table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect

2017-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878598#comment-15878598
 ] 

Sean Owen commented on SPARK-19392:
---

It's because SPARK-16625 was back-ported to the 1.6.x release in CDH, I guess 
because of a customer problem.
It's at 
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L33

I could easily back-port SPARK-16625 to 2.0.x and 1.6.x upstream. I don't think 
there was a particular reason it wasn't, so I'll ask if there are any 
objections and then do so.

> Throw an exception "NoSuchElementException: key not found: scale" in 
> OracleDialect
> --
>
> Key: SPARK-19392
> URL: https://issues.apache.org/jira/browse/SPARK-19392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In OracleDialect, if you use Numeric types in `DataFrameWriter` with Oracle 
> jdbc, this throws an exception below;
> {code}
>   java.util.NoSuchElementException: key not found: scale  at 
> scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)  at 
> scala.collection.MapLike$class.apply(MapLike.scala:141)
> {code}
> This ticket comes from 
> https://www.mail-archive.com/user@spark.apache.org/msg61280.html.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect

2017-02-22 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878570#comment-15878570
 ] 

Takeshi Yamamuro commented on SPARK-19392:
--

The entry in line 33 of `OracleDialect` does not exist (See: 
https://github.com/apache/spark/blob/v1.6.0/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L33)in
 the community released v1.6.0.
So, it'd be better to ask Cloudera guys? Thanks!

> Throw an exception "NoSuchElementException: key not found: scale" in 
> OracleDialect
> --
>
> Key: SPARK-19392
> URL: https://issues.apache.org/jira/browse/SPARK-19392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In OracleDialect, if you use Numeric types in `DataFrameWriter` with Oracle 
> jdbc, this throws an exception below;
> {code}
>   java.util.NoSuchElementException: key not found: scale  at 
> scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)  at 
> scala.collection.MapLike$class.apply(MapLike.scala:141)
> {code}
> This ticket comes from 
> https://www.mail-archive.com/user@spark.apache.org/msg61280.html.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect

2017-02-22 Thread Hokam Singh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878520#comment-15878520
 ] 

Hokam Singh Chauhan commented on SPARK-19392:
-

I am also getting the similar issue with Spark 1.6.0 on CDH5.8.3 environment.

17/02/22 18:26:11 INFO com.AppReceiver: Query in AppReceiver : select * from 
tutorials
17/02/22 18:26:14 ERROR com.AppDriver: Failed to start the driver for 
JDBCOracleApp
java.util.NoSuchElementException: key not found: scale
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at org.apache.spark.sql.types.Metadata.get(Metadata.scala:108)
at org.apache.spark.sql.types.Metadata.getLong(Metadata.scala:51)
at 
org.apache.spark.sql.jdbc.OracleDialect$.getCatalystType(OracleDialect.scala:33)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:140)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:57)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)

Can anyone please help on this to resolve it in existing version?

> Throw an exception "NoSuchElementException: key not found: scale" in 
> OracleDialect
> --
>
> Key: SPARK-19392
> URL: https://issues.apache.org/jira/browse/SPARK-19392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In OracleDialect, if you use Numeric types in `DataFrameWriter` with Oracle 
> jdbc, this throws an exception below;
> {code}
>   java.util.NoSuchElementException: key not found: scale  at 
> scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)  at 
> scala.collection.MapLike$class.apply(MapLike.scala:141)
> {code}
> This ticket comes from 
> https://www.mail-archive.com/user@spark.apache.org/msg61280.html.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7869) Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns

2017-02-22 Thread Praveen Tallapudi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878433#comment-15878433
 ] 

Praveen Tallapudi edited comment on SPARK-7869 at 2/22/17 3:28 PM:
---

Hi Nipun, I am using Spark. Is there a way to insert the Jsonb data into 
postgres. We have a new project in design phase. We are thinking of using 
Apache Spark + Postgres DB. But we are facing issues while inserting JSONB data 
type. 
Is there a support for Postgres-JSONB from spark? Can you please help us ?  I 
have posted this question in the issues but no response. We really need help, 
can you please let us know if there is a way of inserting?? 


was (Author: praveen.tallapudi):
Hi Nipun, I am using Spark. Is there a way to insert the Jsonb data into 
postgres. We have a new project in design phase. We are thinking of using 
Apache Spark + Postgres DB. But we are facing issues while inserting JSONB data 
type. 
Is there a support for Postgres-JSONB from spark? Can you please help us ?  I 
have posted this question in the issues but no response. Can you please help?? 
We really need help, can you please help?? 

> Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns
> --
>
> Key: SPARK-7869
> URL: https://issues.apache.org/jira/browse/SPARK-7869
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.3.1
> Environment: Spark 1.3.1
>Reporter: Brad Willard
>Assignee: Alexey Grishchenko
>Priority: Minor
> Fix For: 1.6.0
>
>
> Most of our tables load into dataframes just fine with postgres. However we 
> have a number of tables leveraging the JSONB datatype. Spark will error and 
> refuse to load this table. While asking for Spark to support JSONB might be a 
> tall order in the short term, it would be great if Spark would at least load 
> the table ignoring the columns it can't load or have it be an option.
> {code}
> pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json")
> Py4JJavaError: An error occurred while calling o41.load.
> : java.sql.SQLException: Unsupported type 
> at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78)
> at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112)
> at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:133)
> at 
> org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
> at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19692) Comparison on BinaryType has incorrect results

2017-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878443#comment-15878443
 ] 

Sean Owen edited comment on SPARK-19692 at 2/22/17 3:16 PM:


Bytes are signed in the JVM, and thus in Scala and Java. It's always been this 
way everywhere and isn't specific to Spark. 0x8C, as a byte, is a way of 
writing -116, not a positive value. 0x8C is a positive integer literal, but 
when cast to a byte, it's a negative 2s-complement value.


was (Author: srowen):
Bytes are signed in the JVM, and thus in Scala and Java. It's always been this 
way everywhere and isn't specific to Spark. 0x8C is a way of writing -116, not 
a positive value.

> Comparison on BinaryType has incorrect results
> --
>
> Key: SPARK-19692
> URL: https://issues.apache.org/jira/browse/SPARK-19692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Smith 
>
> I believe there is an issue with comparisons on binary fields:
> {code}
>   val sc = SparkSession.builder.appName("test").getOrCreate()
>   val schema = StructType(Seq(StructField("ip", BinaryType)))
>   val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => 
> InetAddress.getByName(s).getAddress)
>   val df = sc.createDataFrame(
> sc.sparkContext.parallelize(ips, 1).map { ip =>
>   Row(ip)
> }, schema
>   )
>   val query = df
> .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress)
> .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress)
>   logger.info(query.explain(true))
>   val results = query.collect()
>   results.length mustEqual 1
> {code}
> returns no results.
> i believe the problem is that the comparison is coercing the bytes to signed 
> integers in the call to compareTo here in TypeUtils: 
> {code}
>   def compareBinary(x: Array[Byte], y: Array[Byte]): Int = {
> for (i <- 0 until x.length; if i < y.length) {
>   val res = x(i).compareTo(y(i))
>   if (res != 0) return res
> }
> x.length - y.length
>   }
> {code}
> with some hacky testing i was able to get the desired results with: {code} 
> val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code}
> thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19692) Comparison on BinaryType has incorrect results

2017-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878443#comment-15878443
 ] 

Sean Owen commented on SPARK-19692:
---

Bytes are signed in the JVM, and thus in Scala and Java. It's always been this 
way everywhere and isn't specific to Spark. 0x8C is a way of writing -116, not 
a positive value.

> Comparison on BinaryType has incorrect results
> --
>
> Key: SPARK-19692
> URL: https://issues.apache.org/jira/browse/SPARK-19692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Smith 
>
> I believe there is an issue with comparisons on binary fields:
> {code}
>   val sc = SparkSession.builder.appName("test").getOrCreate()
>   val schema = StructType(Seq(StructField("ip", BinaryType)))
>   val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => 
> InetAddress.getByName(s).getAddress)
>   val df = sc.createDataFrame(
> sc.sparkContext.parallelize(ips, 1).map { ip =>
>   Row(ip)
> }, schema
>   )
>   val query = df
> .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress)
> .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress)
>   logger.info(query.explain(true))
>   val results = query.collect()
>   results.length mustEqual 1
> {code}
> returns no results.
> i believe the problem is that the comparison is coercing the bytes to signed 
> integers in the call to compareTo here in TypeUtils: 
> {code}
>   def compareBinary(x: Array[Byte], y: Array[Byte]): Int = {
> for (i <- 0 until x.length; if i < y.length) {
>   val res = x(i).compareTo(y(i))
>   if (res != 0) return res
> }
> x.length - y.length
>   }
> {code}
> with some hacky testing i was able to get the desired results with: {code} 
> val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code}
> thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7869) Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns

2017-02-22 Thread Praveen Tallapudi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878433#comment-15878433
 ] 

Praveen Tallapudi commented on SPARK-7869:
--

Hi Nipun, I am using Spark. Is there a way to insert the Jsonb data into 
postgres. We have a new project in design phase. We are thinking of using 
Apache Spark + Postgres DB. But we are facing issues while inserting JSONB data 
type. 
Is there a support for Postgres-JSONB from spark? Can you please help us ?  I 
have posted this question in the issues but no response. Can you please help?? 
We really need help, can you please help?? 

> Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns
> --
>
> Key: SPARK-7869
> URL: https://issues.apache.org/jira/browse/SPARK-7869
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.3.1
> Environment: Spark 1.3.1
>Reporter: Brad Willard
>Assignee: Alexey Grishchenko
>Priority: Minor
> Fix For: 1.6.0
>
>
> Most of our tables load into dataframes just fine with postgres. However we 
> have a number of tables leveraging the JSONB datatype. Spark will error and 
> refuse to load this table. While asking for Spark to support JSONB might be a 
> tall order in the short term, it would be great if Spark would at least load 
> the table ignoring the columns it can't load or have it be an option.
> {code}
> pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json")
> Py4JJavaError: An error occurred while calling o41.load.
> : java.sql.SQLException: Unsupported type 
> at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78)
> at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112)
> at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:133)
> at 
> org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
> at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19692) Comparison on BinaryType has incorrect results

2017-02-22 Thread Don Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878431#comment-15878431
 ] 

Don Smith  commented on SPARK-19692:


an even more trivial example:
{code}
  val sc = SparkSession.builder.appName("test").getOrCreate()
  val schema = StructType(Seq(StructField("byte", BinaryType)))

  val byte = Seq(Array(0x8C.toByte))

  val df = sc.createDataFrame(
sc.sparkContext.parallelize(byte, 1).map { ip =>
  SQLRow(ip)
}, schema
  )

  logger.info(df.show)

  val query = df
.where(df("byte") >= Array(0x00.toByte))
.where(df("byte") <= Array(0xFF.toByte))

  logger.info(query.explain(true))
  val results = query.collect()
  results.length mustEqual 1
{code}

i'm having trouble believing this is the expected behavior, and if it is, is it 
defined somewhere?


> Comparison on BinaryType has incorrect results
> --
>
> Key: SPARK-19692
> URL: https://issues.apache.org/jira/browse/SPARK-19692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Smith 
>
> I believe there is an issue with comparisons on binary fields:
> {code}
>   val sc = SparkSession.builder.appName("test").getOrCreate()
>   val schema = StructType(Seq(StructField("ip", BinaryType)))
>   val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => 
> InetAddress.getByName(s).getAddress)
>   val df = sc.createDataFrame(
> sc.sparkContext.parallelize(ips, 1).map { ip =>
>   Row(ip)
> }, schema
>   )
>   val query = df
> .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress)
> .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress)
>   logger.info(query.explain(true))
>   val results = query.collect()
>   results.length mustEqual 1
> {code}
> returns no results.
> i believe the problem is that the comparison is coercing the bytes to signed 
> integers in the call to compareTo here in TypeUtils: 
> {code}
>   def compareBinary(x: Array[Byte], y: Array[Byte]): Int = {
> for (i <- 0 until x.length; if i < y.length) {
>   val res = x(i).compareTo(y(i))
>   if (res != 0) return res
> }
> x.length - y.length
>   }
> {code}
> with some hacky testing i was able to get the desired results with: {code} 
> val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code}
> thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-02-22 Thread jin xing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878389#comment-15878389
 ] 

jin xing commented on SPARK-19659:
--

[~irashid]
Thanks a lot for your comments. I will file a design pdf late this week and 
your concerns will be included.

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19679) Destroy broadcasted object without blocking

2017-02-22 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19679.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17016
[https://github.com/apache/spark/pull/17016]

> Destroy broadcasted object without blocking
> ---
>
> Key: SPARK-19679
> URL: https://issues.apache.org/jira/browse/SPARK-19679
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Destroy broadcasted object without blocking in ML



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19679) Destroy broadcasted object without blocking

2017-02-22 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19679:
--

Assignee: zhengruifeng

> Destroy broadcasted object without blocking
> ---
>
> Key: SPARK-19679
> URL: https://issues.apache.org/jira/browse/SPARK-19679
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Destroy broadcasted object without blocking in ML



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel

2017-02-22 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19694:
--

Assignee: zhengruifeng

> Add missing 'setTopicDistributionCol' for LDAModel
> --
>
> Key: SPARK-19694
> URL: https://issues.apache.org/jira/browse/SPARK-19694
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.2.0
>
>
> {{LDAModel}} can not set the output column now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel

2017-02-22 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19694.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17021
[https://github.com/apache/spark/pull/17021]

> Add missing 'setTopicDistributionCol' for LDAModel
> --
>
> Key: SPARK-19694
> URL: https://issues.apache.org/jira/browse/SPARK-19694
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Trivial
> Fix For: 2.2.0
>
>
> {{LDAModel}} can not set the output column now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19691) Calculating percentile of decimal column fails with ClassCastException

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19691:


Assignee: Apache Spark

> Calculating percentile of decimal column fails with ClassCastException
> --
>
> Key: SPARK-19691
> URL: https://issues.apache.org/jira/browse/SPARK-19691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Running
> {code}
> spark.range(10).selectExpr("cast (id as decimal) as 
> x").selectExpr("percentile(x, 0.5)").collect()
> {code}
> results in a ClassCastException:
> {code}
>  java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be 
> cast to java.lang.Number
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:78)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:113)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19691) Calculating percentile of decimal column fails with ClassCastException

2017-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878270#comment-15878270
 ] 

Apache Spark commented on SPARK-19691:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17028

> Calculating percentile of decimal column fails with ClassCastException
> --
>
> Key: SPARK-19691
> URL: https://issues.apache.org/jira/browse/SPARK-19691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>
> Running
> {code}
> spark.range(10).selectExpr("cast (id as decimal) as 
> x").selectExpr("percentile(x, 0.5)").collect()
> {code}
> results in a ClassCastException:
> {code}
>  java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be 
> cast to java.lang.Number
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:78)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:113)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19691) Calculating percentile of decimal column fails with ClassCastException

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19691:


Assignee: (was: Apache Spark)

> Calculating percentile of decimal column fails with ClassCastException
> --
>
> Key: SPARK-19691
> URL: https://issues.apache.org/jira/browse/SPARK-19691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>
> Running
> {code}
> spark.range(10).selectExpr("cast (id as decimal) as 
> x").selectExpr("percentile(x, 0.5)").collect()
> {code}
> results in a ClassCastException:
> {code}
>  java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be 
> cast to java.lang.Number
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:78)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:113)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19650:


Assignee: Apache Spark  (was: Sameer Agarwal)

> Metastore-only operations shouldn't trigger a spark job
> ---
>
> Key: SPARK-19650
> URL: https://issues.apache.org/jira/browse/SPARK-19650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>
> We currently trigger a spark job even for simple metastore operations ({{SHOW 
> TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these 
> otherwise get executed on a driver, it prevents a user from doing these 
> operations on a driver-only cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job

2017-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878260#comment-15878260
 ] 

Apache Spark commented on SPARK-19650:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17027

> Metastore-only operations shouldn't trigger a spark job
> ---
>
> Key: SPARK-19650
> URL: https://issues.apache.org/jira/browse/SPARK-19650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>
> We currently trigger a spark job even for simple metastore operations ({{SHOW 
> TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these 
> otherwise get executed on a driver, it prevents a user from doing these 
> operations on a driver-only cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job

2017-02-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19650:


Assignee: Sameer Agarwal  (was: Apache Spark)

> Metastore-only operations shouldn't trigger a spark job
> ---
>
> Key: SPARK-19650
> URL: https://issues.apache.org/jira/browse/SPARK-19650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>
> We currently trigger a spark job even for simple metastore operations ({{SHOW 
> TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these 
> otherwise get executed on a driver, it prevents a user from doing these 
> operations on a driver-only cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >