[jira] [Commented] (SPARK-15790) Audit @Since annotations in ML

2017-03-09 Thread Ehsun Behravesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904652#comment-15904652
 ] 

Ehsun Behravesh commented on SPARK-15790:
-

Does this JIRA still need someone to work on? 

> Audit @Since annotations in ML
> --
>
> Key: SPARK-15790
> URL: https://issues.apache.org/jira/browse/SPARK-15790
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Many classes & methods in ML are missing {{@Since}} annotations. Audit what's 
> missing and add annotations to public API constructors, vals and methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19894) Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack

2017-03-09 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904621#comment-15904621
 ] 

Yuechen Chen commented on SPARK-19894:
--

https://github.com/apache/spark/pull/17238

> Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack
> -
>
> Key: SPARK-19894
> URL: https://issues.apache.org/jira/browse/SPARK-19894
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 2.1.0
> Environment: Yarn-cluster
>Reporter: Yuechen Chen
>
> In YARN-cluster mode, if driver has no rack information on two different 
> hosts, these two hosts would both be recoginized as "/default-rack", which 
> may cause some bugs.
> For example, if hosts of one executor and one external datasource are unknown 
> by driver, this two hosts would be recoginized as the same rack 
> "/default-rack", and then all tasks would be assigned to the executor.
> This bug would be avoided, if getRackForHost("unknown host") in YarnScheduler 
> returns None, not Some("/default-rack").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19885) The config ignoreCorruptFiles doesn't work for CSV

2017-03-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904606#comment-15904606
 ] 

Wenchen Fan edited comment on SPARK-19885 at 3/10/17 7:22 AM:
--

This is because we support different charset for CSV files, and our text file 
format only supports UTF8, so we have to use `HadoopRDD` when infer schema for 
CSV data source, which doesn't recognize the ignoreCorruptedFiles options

I've checked the history, this feature was there the first day we introduce CSV 
data source. However, all other text-based data source support only UTF8, also 
CSV with wholeFile enabled only supports UTF8.

shall we just remove the support for different charsets? or support this 
feature for all text-based data source?

cc [~hyukjin.kwon]


was (Author: cloud_fan):
This is because we support different charset for CSV files, and our text file 
format only supports UTF8, so we have to use `HadoopRDD` when infer schema for 
CSV data source.

I've checked the history, this feature was there the first day we introduce CSV 
data source. However, all other text-based data source support only UTF8, also 
CSV with wholeFile enabled only supports UTF8.

shall we just remove the support for different charsets? or support this 
feature for all text-based data source?

cc [~hyukjin.kwon]

> The config ignoreCorruptFiles doesn't work for CSV
> --
>
> Key: SPARK-19885
> URL: https://issues.apache.org/jira/browse/SPARK-19885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>
> CSVFileFormat.inferSchema doesn't use FileScanRDD so the SQL 
> "ignoreCorruptFiles" doesn't work.
> {code}
> java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at 

[jira] [Commented] (SPARK-19885) The config ignoreCorruptFiles doesn't work for CSV

2017-03-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904606#comment-15904606
 ] 

Wenchen Fan commented on SPARK-19885:
-

This is because we support different charset for CSV files, and our text file 
format only supports UTF8, so we have to use `HadoopRDD` when infer schema for 
CSV data source.

I've checked the history, this feature was there the first day we introduce CSV 
data source. However, all other text-based data source support only UTF8, also 
CSV with wholeFile enabled only supports UTF8.

shall we just remove the support for different charsets? or support this 
feature for all text-based data source?

cc [~hyukjin.kwon]

> The config ignoreCorruptFiles doesn't work for CSV
> --
>
> Key: SPARK-19885
> URL: https://issues.apache.org/jira/browse/SPARK-19885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>
> CSVFileFormat.inferSchema doesn't use FileScanRDD so the SQL 
> "ignoreCorruptFiles" doesn't work.
> {code}
> java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> 

[jira] [Assigned] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19852:


Assignee: (was: Apache Spark)

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904590#comment-15904590
 ] 

Apache Spark commented on SPARK-19852:
--

User 'VinceShieh' has created a pull request for this issue:
https://github.com/apache/spark/pull/17237

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19852:


Assignee: Apache Spark

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19891) Await Batch Lock not signaled on stream execution exit

2017-03-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19891.
--
   Resolution: Fixed
 Assignee: Tyson Condie
Fix Version/s: 2.2.0
   2.1.1

> Await Batch Lock not signaled on stream execution exit
> --
>
> Key: SPARK-19891
> URL: https://issues.apache.org/jira/browse/SPARK-19891
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Tyson Condie
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19893) Cannot run intersect/except with map type

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19893:


Assignee: Wenchen Fan  (was: Apache Spark)

> Cannot run intersect/except with map type
> -
>
> Key: SPARK-19893
> URL: https://issues.apache.org/jira/browse/SPARK-19893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19894) Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack

2017-03-09 Thread Yuechen Chen (JIRA)
Yuechen Chen created SPARK-19894:


 Summary: Tasks entirely assigned to one executor on Yarn-cluster 
mode for default-rack
 Key: SPARK-19894
 URL: https://issues.apache.org/jira/browse/SPARK-19894
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 2.1.0
 Environment: Yarn-cluster
Reporter: Yuechen Chen


In YARN-cluster mode, if driver has no rack information on two different hosts, 
these two hosts would both be recoginized as "/default-rack", which may cause 
some bugs.
For example, if hosts of one executor and one external datasource are unknown 
by driver, this two hosts would be recoginized as the same rack 
"/default-rack", and then all tasks would be assigned to the executor.
This bug would be avoided, if getRackForHost("unknown host") in YarnScheduler 
returns None, not Some("/default-rack").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19893) Cannot run intersect/except with map type

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19893:


Assignee: Apache Spark  (was: Wenchen Fan)

> Cannot run intersect/except with map type
> -
>
> Key: SPARK-19893
> URL: https://issues.apache.org/jira/browse/SPARK-19893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19893) Cannot run intersect/except with map type

2017-03-09 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19893:
---

 Summary: Cannot run intersect/except with map type
 Key: SPARK-19893
 URL: https://issues.apache.org/jira/browse/SPARK-19893
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program

2017-03-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19008:
---

Assignee: Kazuaki Ishizaki

> Avoid boxing/unboxing overhead of calling a lambda with primitive type from 
> Dataset program
> ---
>
> Key: SPARK-19008
> URL: https://issues.apache.org/jira/browse/SPARK-19008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> In a 
> [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
> between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
> boxing/unboxing overhead when a Dataset program calls a lambda, which 
> operates on a primitive type, written in Scala.
> In such a case, Catalyst can directly call a method {{ 
> apply();}} instead of {{Object apply(Object);}}.
> Of course, the best solution seems to be 
> [here|https://issues.apache.org/jira/browse/SPARK-14083].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19893) Cannot run intersect/except with map type

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904587#comment-15904587
 ] 

Apache Spark commented on SPARK-19893:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17236

> Cannot run intersect/except with map type
> -
>
> Key: SPARK-19893
> URL: https://issues.apache.org/jira/browse/SPARK-19893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program

2017-03-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19008.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17172
[https://github.com/apache/spark/pull/17172]

> Avoid boxing/unboxing overhead of calling a lambda with primitive type from 
> Dataset program
> ---
>
> Key: SPARK-19008
> URL: https://issues.apache.org/jira/browse/SPARK-19008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> In a 
> [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
> between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
> boxing/unboxing overhead when a Dataset program calls a lambda, which 
> operates on a primitive type, written in Scala.
> In such a case, Catalyst can directly call a method {{ 
> apply();}} instead of {{Object apply(Object);}}.
> Of course, the best solution seems to be 
> [here|https://issues.apache.org/jira/browse/SPARK-14083].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19143) API in Spark for distributing new delegation tokens (Improve delegation token handling in secure clusters)

2017-03-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904497#comment-15904497
 ] 

Saisai Shao commented on SPARK-19143:
-

Hi all, I wrote a rough design doc based on the comments above, here is the 
link 
(https://docs.google.com/document/d/1DFWGHu4_GJapbbfXGWsot_z_W9Wka_39DFNmg9r9SAI/edit?usp=sharing).

[~tgraves] [~vanzin] [~mridulm80] please review and comment, greatly appreciate 
your suggestions.

> API in Spark for distributing new delegation tokens (Improve delegation token 
> handling in secure clusters)
> --
>
> Key: SPARK-19143
> URL: https://issues.apache.org/jira/browse/SPARK-19143
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Ruslan Dautkhanov
>
> Spin off from SPARK-14743 and comments chain in [recent comments| 
> https://issues.apache.org/jira/browse/SPARK-5493?focusedCommentId=15802179=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15802179]
>  in SPARK-5493.
> Spark currently doesn't have a way for distribution new delegation tokens. 
> Quoting [~vanzin] from SPARK-5493 
> {quote}
> IIRC Livy doesn't yet support delegation token renewal. Once it reaches the 
> TTL, the session is unusable.
> There might be ways to hack support for that without changes in Spark, but 
> I'd like to see a proper API in Spark for distributing new delegation tokens. 
> I mentioned that in SPARK-14743, but although that bug is closed, that 
> particular feature hasn't been implemented yet.
> {quote}
> Other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19320:


Assignee: (was: Apache Spark)

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904484#comment-15904484
 ] 

Apache Spark commented on SPARK-19320:
--

User 'yanji84' has created a pull request for this issue:
https://github.com/apache/spark/pull/17235

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19320:


Assignee: Apache Spark

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-03-09 Thread Ji Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904479#comment-15904479
 ] 

Ji Yan commented on SPARK-19320:


i'm proposing to add a configuration parameter to guarantee a hard limit 
(spark.mesos.gpus) on gpu numbers. To avoid conflict, it will override 
spark.mesos.gpus.max whenever spark.mesos.gpus is greater than 0.

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19892) Implement findAnalogies method for Word2VecModel

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904471#comment-15904471
 ] 

Apache Spark commented on SPARK-19892:
--

User 'benradford' has created a pull request for this issue:
https://github.com/apache/spark/pull/17234

> Implement findAnalogies method for Word2VecModel 
> -
>
> Key: SPARK-19892
> URL: https://issues.apache.org/jira/browse/SPARK-19892
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Benjamin Radford
>Priority: Minor
>  Labels: features, newbie
>
> Word2VecModel is missing a method that allows for performing analogy-like 
> queries on word vectors (e.g. King + Woman - Man = Queen). This is a 
> functionality common to other word2vec implementations (see gensim) and is a 
> major component of word2vec's appeal as cited in seminal works on the model 
> (https://code.google.com/archive/p/word2vec/). 
> An implementation of this method, findAnalogies, should accept three 
> arguments:
> * positive - Array[String] of similar words
> * negative - Array[String] of dissimilar words
> * num - Int number of synonyms or nearest neighbors to calculated vector



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19892) Implement findAnalogies method for Word2VecModel

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19892:


Assignee: Apache Spark

> Implement findAnalogies method for Word2VecModel 
> -
>
> Key: SPARK-19892
> URL: https://issues.apache.org/jira/browse/SPARK-19892
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Benjamin Radford
>Assignee: Apache Spark
>Priority: Minor
>  Labels: features, newbie
>
> Word2VecModel is missing a method that allows for performing analogy-like 
> queries on word vectors (e.g. King + Woman - Man = Queen). This is a 
> functionality common to other word2vec implementations (see gensim) and is a 
> major component of word2vec's appeal as cited in seminal works on the model 
> (https://code.google.com/archive/p/word2vec/). 
> An implementation of this method, findAnalogies, should accept three 
> arguments:
> * positive - Array[String] of similar words
> * negative - Array[String] of dissimilar words
> * num - Int number of synonyms or nearest neighbors to calculated vector



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19892) Implement findAnalogies method for Word2VecModel

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19892:


Assignee: (was: Apache Spark)

> Implement findAnalogies method for Word2VecModel 
> -
>
> Key: SPARK-19892
> URL: https://issues.apache.org/jira/browse/SPARK-19892
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Benjamin Radford
>Priority: Minor
>  Labels: features, newbie
>
> Word2VecModel is missing a method that allows for performing analogy-like 
> queries on word vectors (e.g. King + Woman - Man = Queen). This is a 
> functionality common to other word2vec implementations (see gensim) and is a 
> major component of word2vec's appeal as cited in seminal works on the model 
> (https://code.google.com/archive/p/word2vec/). 
> An implementation of this method, findAnalogies, should accept three 
> arguments:
> * positive - Array[String] of similar words
> * negative - Array[String] of dissimilar words
> * num - Int number of synonyms or nearest neighbors to calculated vector



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2017-03-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904460#comment-15904460
 ] 

Kazuaki Ishizaki commented on SPARK-14083:
--

I rebased this with master: 
https://github.com/kiszk/spark/tree/expression-analysis4

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6634) Allow replacing columns in Transformers

2017-03-09 Thread Tree Field (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904445#comment-15904445
 ] 

Tree Field commented on SPARK-6634:
---

I want this feature too.
because I often overwrite UnaryTransformer by myself  to enable this.

It seems it's only prevented in transformSchema method.
Now, unlike before v1.4,  dataframe's withColumn method used in 
UnaryTransformer allows replacing  the input column.

Any other reasons that is not allowed in transoformer, especially in 
UnaryTransformer.





> Allow replacing columns in Transformers
> ---
>
> Key: SPARK-6634
> URL: https://issues.apache.org/jira/browse/SPARK-6634
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, Transformers do not allow input and output columns to share the 
> same name.  (In fact, this is not allowed but also not even checked.)
> Short-term proposal: Disallow input and output columns with the same name, 
> and add a check in transformSchema.
> Long-term proposal: Allow input & output columns with the same name, and 
> where the behavior is that the output columns replace input columns with the 
> same name.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11569) StringIndexer transform fails when column contains nulls

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904448#comment-15904448
 ] 

Apache Spark commented on SPARK-11569:
--

User 'crackcell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17233

> StringIndexer transform fails when column contains nulls
> 
>
> Key: SPARK-11569
> URL: https://issues.apache.org/jira/browse/SPARK-11569
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>
> Transforming column containing {{null}} values using {{StringIndexer}} 
> results in {{java.lang.NullPointerException}}
> {code}
> from pyspark.ml.feature import StringIndexer
> df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
> df.printSchema()
> ## root
> ##  |-- k: string (nullable = true)
> ##  |-- v: long (nullable = true)
> indexer = StringIndexer(inputCol="k", outputCol="kIdx")
> indexer.fit(df).transform(df)
> ##  py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
> ## : java.lang.NullPointerException
> {code}
> Problem disappears when we drop 
> {code}
> df1 = df.na.drop()
> indexer.fit(df1).transform(df1)
> {code}
> or replace {{nulls}}
> {code}
> from pyspark.sql.functions import col, when
> k = col("k")
> df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
> indexer.fit(df2).transform(df2)
> {code}
> and cannot be reproduced using Scala API
> {code}
> import org.apache.spark.ml.feature.StringIndexer
> val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
> df.printSchema
> // root
> //  |-- k: string (nullable = true)
> //  |-- v: integer (nullable = false)
> val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")
> indexer.fit(df).transform(df).count
> // 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19892) Implement findAnalogies method for Word2VecModel

2017-03-09 Thread Benjamin Radford (JIRA)
Benjamin Radford created SPARK-19892:


 Summary: Implement findAnalogies method for Word2VecModel 
 Key: SPARK-19892
 URL: https://issues.apache.org/jira/browse/SPARK-19892
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 2.1.0
Reporter: Benjamin Radford
Priority: Minor


Word2VecModel is missing a method that allows for performing analogy-like 
queries on word vectors (e.g. King + Woman - Man = Queen). This is a 
functionality common to other word2vec implementations (see gensim) and is a 
major component of word2vec's appeal as cited in seminal works on the model 
(https://code.google.com/archive/p/word2vec/). 

An implementation of this method, findAnalogies, should accept three arguments:
* positive - Array[String] of similar words
* negative - Array[String] of dissimilar words
* num - Int number of synonyms or nearest neighbors to calculated vector



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18112:


Assignee: Apache Spark  (was: Xiao Li)

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Apache Spark
>Priority: Critical
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18112:


Assignee: Xiao Li  (was: Apache Spark)

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904429#comment-15904429
 ] 

Apache Spark commented on SPARK-18112:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17232

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2017-03-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17557.
--
Resolution: Cannot Reproduce

^ I can't reproduce too. Let me resolve this. Please reopen this if anyone 
still faces this issue and I was mistaken.

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2017-03-09 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904397#comment-15904397
 ] 

Zhenhua Wang edited comment on SPARK-16283 at 3/10/17 4:09 AM:
---

[~erlu] I think it's been made clear from the above discussions, Spark's result 
doesn't have to be the same as Hive's result.


was (Author: zenwzh):
[~erlu] I think it's been made clear from above discussions, Spark' result 
doesn't have to be the same as Hive's result.

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2017-03-09 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904397#comment-15904397
 ] 

Zhenhua Wang commented on SPARK-16283:
--

[~erlu] I think it's been made clear from above discussions, Spark' result 
doesn't have to be the same as Hive's result.

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18112:

Component/s: (was: Spark Submit)

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904362#comment-15904362
 ] 

Xiao Li commented on SPARK-18112:
-

Let me resolve it for supporting Hive 2.1.0 metastore.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18112:

Priority: Critical  (was: Major)

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-18112:
-

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-18112:
---

Assignee: Xiao Li

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19711) Bug in gapply function

2017-03-09 Thread Yeonseop Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904346#comment-15904346
 ] 

Yeonseop Kim commented on SPARK-19711:
--

 It seems to be more nice  Using "stringsAsFactor = FALSE", I think

schema <- structType (structField ("CNPJ", "string"))
result <- gapply(
ds,
c("CNPJ", "PID"),
function(key, x)
{ data.frame(CNPJ = x$CNPJ, stringsAsFactor = FALSE) }
,
schema)

> Bug in gapply function
> --
>
> Key: SPARK-19711
> URL: https://issues.apache.org/jira/browse/SPARK-19711
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
> Environment: Using Databricks plataform.
>Reporter: Luis Felipe Sant Ana
> Attachments: mv_demand_20170221.csv, resume.R
>
>
> I have a dataframe in SparkR like 
>   CNPJPID   DATA N
> 1 10140281000131 100021 2015-04-23 1
> 2 10140281000131 100021 2015-04-27 1
> 3 10140281000131 100021 2015-04-02 1
> 4 10140281000131 100021 2015-11-10 1
> 5 10140281000131 100021 2016-11-14 1
> 6 10140281000131 100021 2015-04-03 1
> And, I want to group by columns CNPJ and PID using gapply() function, filling 
> in the column DATA with date. Then I fill in the missing dates with zeros.
> The code:
> schema <- structType(structField("CNPJ", "string"), 
>  structField("PID", "string"),
>  structField("DATA", "date"),
>  structField("N", "double"))
> result <- gapply(
>   ds_filtered,
>   c("CNPJ", "PID"),
>   function(key, x) {
> dts <- data.frame(key, DATA = seq(min(as.Date(x$DATA)), as.Date(e_date), 
> "days"))
> colnames(dts)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- data.frame(key, DATA = as.Date(x$DATA), N = x$N)
> colnames(y)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- dplyr::left_join(dts, 
>  y,
>  by = c("CNPJ", "PID", "DATA"))
> 
> y[is.na(y$N), 4] <- 0
> 
> data.frame(CNPJ = as.character(y$CNPJ),
>PID = as.character(y$PID),
>DATA = y$DATA,
>N = y$N)
>   }, 
>   schema)
> Error:
> Error in handleErrors(returnStatus, conn) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 92.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 92.0 (TID 7032, 10.93.243.111, executor 0): org.apache.spark.SparkException: 
> R computation failed with
>  Error in writeType(con, serdeType) : 
>   Unsupported type for serialization factor
> Calls: outputResult ... serializeRow -> writeList -> writeObject -> writeType
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:404)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:386)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

[jira] [Commented] (SPARK-17322) 'ANY n' clause for SQL queries to increase the ease of use of WHERE clause predicates

2017-03-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904285#comment-15904285
 ] 

Hyukjin Kwon commented on SPARK-17322:
--

Let me leave a link that might be helpful - 
http://stackoverflow.com/questions/8771596/are-sql-any-and-some-keywords-synonyms-in-all-sql-dialects

> 'ANY n' clause for SQL queries to increase the ease of use of WHERE clause 
> predicates
> -
>
> Key: SPARK-17322
> URL: https://issues.apache.org/jira/browse/SPARK-17322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Suman Somasundar
>Priority: Minor
>
> If the user is interested in getting the results that meet 'any n' criteria 
> out of m where clause predicates (m > n), then the 'any n' clause greatly 
> simplifies writing a SQL query.
> An example is given below:
> select symbol from stocks where (market_cap > 5.7b, analysts_recommend > 10, 
> moving_avg > 49.2, pe_ratio >15.4) ANY 3



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16754) NPE when defining case class and searching Encoder in the same line

2017-03-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904272#comment-15904272
 ] 

Hyukjin Kwon commented on SPARK-16754:
--

Today, I just tested this for my curiosity. It seems prints a different error 
as below:

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

case class TestCaseClass(value: Int)
import spark.implicits._
Seq(TestCaseClass(1)).toDS().collect()


// Exiting paste mode, now interpreting.

java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException
newInstance(class TestCaseClass)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:303)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:2807)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:2807)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2807)
  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2360)
  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2360)
  at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2791)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2790)
  at org.apache.spark.sql.Dataset.collect(Dataset.scala:2360)
  ... 47 elided
Caused by: java.lang.NullPointerException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:497)
  at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$5.apply(objects.scala:320)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$5.apply(objects.scala:320)
  at scala.Option.map(Option.scala:146)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:320)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:145)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:142)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:142)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:887)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.constructProjection$lzycompute(ExpressionEncoder.scala:269)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.constructProjection(ExpressionEncoder.scala:269)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:300)
  ... 62 more
{code}

> NPE when defining case class and searching Encoder in the same line
> ---
>
> Key: SPARK-16754
> URL: https://issues.apache.org/jira/browse/SPARK-16754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark Shell for Scala 2.11
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Reproducer:
> 

[jira] [Resolved] (SPARK-16255) Spark2.0 doesn't support the following SQL statement:"insert into directory "/u_qa_user/hive_testdata/test1/t1" select * from d_test_tpc_2g_txt.auction" while Hive supp

2017-03-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16255.
--
Resolution: Duplicate

I think this is a duplicate of SPARK-4131. Please reopen this if I was mistaken.

> Spark2.0 doesn't support the following SQL statement:"insert into directory 
> "/u_qa_user/hive_testdata/test1/t1" select * from d_test_tpc_2g_txt.auction" 
> while Hive supports
> 
>
> Key: SPARK-16255
> URL: https://issues.apache.org/jira/browse/SPARK-16255
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
>
> Spark2.0 doesn't support the following SQL statement:"insert into directory 
> "/u_qa_user/hive_testdata/test1/t1" select * from d_test_tpc_2g_txt.auction" 
> while Hive supports



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19886.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> reportDataLoss cause != null check is wrong for Structured Streaming 
> KafkaSource
> 
>
> Key: SPARK-19886
> URL: https://issues.apache.org/jira/browse/SPARK-19886
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19507:


Assignee: Apache Spark

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Assignee: Apache Spark
>Priority: Trivial
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19507:


Assignee: (was: Apache Spark)

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Priority: Trivial
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-03-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-19507:
--

I am reopening this per 
https://github.com/apache/spark/pull/17213#issuecomment-285530248 and resolve 
SPARK-19871 as a duplicate.

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Priority: Trivial
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19871) Improve error message in verify_type to indicate which field the error is for

2017-03-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19871.
--
Resolution: Duplicate

> Improve error message in verify_type to indicate which field the error is for
> -
>
> Key: SPARK-19871
> URL: https://issues.apache.org/jira/browse/SPARK-19871
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Len Frodgers
>
> The error message for _verify_type is too vague. It should specify for which 
> field the type verification failed.
> e.g. error: "This field is not nullable, but got None" – but what is the 
> field?!
> In a dataframe with many non-nullable fields, it is a nightmare trying to 
> hunt down which field is null, since one has to resort to basic (and very 
> slow) trial and error.
> I have happily created a PR to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19891) Await Batch Lock not signaled on stream execution exit

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19891:


Assignee: Apache Spark

> Await Batch Lock not signaled on stream execution exit
> --
>
> Key: SPARK-19891
> URL: https://issues.apache.org/jira/browse/SPARK-19891
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19891) Await Batch Lock not signaled on stream execution exit

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19891:


Assignee: (was: Apache Spark)

> Await Batch Lock not signaled on stream execution exit
> --
>
> Key: SPARK-19891
> URL: https://issues.apache.org/jira/browse/SPARK-19891
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19891) Await Batch Lock not signaled on stream execution exit

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904116#comment-15904116
 ] 

Apache Spark commented on SPARK-19891:
--

User 'tcondie' has created a pull request for this issue:
https://github.com/apache/spark/pull/17231

> Await Batch Lock not signaled on stream execution exit
> --
>
> Key: SPARK-19891
> URL: https://issues.apache.org/jira/browse/SPARK-19891
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19891) Await Batch Lock not signaled on stream execution exit

2017-03-09 Thread Tyson Condie (JIRA)
Tyson Condie created SPARK-19891:


 Summary: Await Batch Lock not signaled on stream execution exit
 Key: SPARK-19891
 URL: https://issues.apache.org/jira/browse/SPARK-19891
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Tyson Condie






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately

2017-03-09 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-19890:
---
Description: 
Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Multiple plan 
can use the same meatstorerelation, but the estimation is still much better 
than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.

  was:
Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Although 
multiple plan can use the same meatstorerelation, but the estimation still much 
better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.


> Make MetastoreRelation statistics estimation more accurately
> 
>
> Key: SPARK-19890
> URL: https://issues.apache.org/jira/browse/SPARK-19890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
> and the size is based on the table scope. But for partitioned table, this 
> statistics is not useful as table size may 100x+ larger than the input 
> partition size. As a result, the join optimization techniques is not 
> applicable.
> It would be great if we can postpone the statistics to the optimization phase 
> to get partition information but before physical plan generation phase so 
> that JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
> sortmerjoin).
> Although the metastorerelation does not associated with partitions, but 
> through PhysicalOperation we can get the partition info for the table. 
> Multiple plan can use the same meatstorerelation, but the estimation is still 
> much better than table size. This way, retrieving statistics is 
> straightforward.
> Another possible way is to have a another data structure associating the 
> metastore relation and partitions with the plan to get most accurate 
> estimation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19353) Support binary I/O in PipedRDD

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904058#comment-15904058
 ] 

Apache Spark commented on SPARK-19353:
--

User 'superbobry' has created a pull request for this issue:
https://github.com/apache/spark/pull/17230

> Support binary I/O in PipedRDD
> --
>
> Key: SPARK-19353
> URL: https://issues.apache.org/jira/browse/SPARK-19353
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sergei Lebedev
>Priority: Minor
>
> The current design of RDD.pipe is very restrictive. 
> It is line-based, each element of the input RDD [gets 
> serialized|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L143]
>  into one or more lines. Similarly for the output of the child process, one 
> line corresponds to a single element of the output RDD. 
> It allows to customize the output format via {{printRDDElement}}, but not the 
> input format.
> It is not designed for extensibility. The only way to get a "BinaryPipedRDD" 
> is to copy/paste most of it and change the relevant parts.
> These limitations have been discussed on 
> [SO|http://stackoverflow.com/questions/27986830/how-to-pipe-binary-data-in-apache-spark]
>  and the mailing list, but alas no issue has been created.
> A possible solution to at least the first two limitations is to factor out 
> the format into a separate object (or objects). For instance, {{InputWriter}} 
> and {{OutputReader}}, following Hadoop streaming API. 
> {code}
> trait InputWriter[T] {
> def write(os: OutputStream, elem: T)
> }
> trait OutputReader[T] {
> def read(is: InputStream): T
> }
> {code}
> The default configuration would be to write and read in line-based format, 
> but the users will also be able to selectively swap those to the appropriate 
> implementations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately

2017-03-09 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-19890:
--

 Summary: Make MetastoreRelation statistics estimation more 
accurately
 Key: SPARK-19890
 URL: https://issues.apache.org/jira/browse/SPARK-19890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Although 
multiple plan can use the same meatstorerelation, but the estimation still much 
better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19889) Make TaskContext synchronized

2017-03-09 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-19889:
-

 Summary: Make TaskContext synchronized
 Key: SPARK-19889
 URL: https://issues.apache.org/jira/browse/SPARK-19889
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Herman van Hovell


In some cases you want to fork of some part of a task to a different thread. In 
these cases it would be very useful if {{TaskContext}} methods are synchronized 
and that we can synchronize on the task context.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904033#comment-15904033
 ] 

Apache Spark commented on SPARK-19611:
--

User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/17229

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Adam Budde
> Fix For: 2.2.0
>
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for saving a case-sensitive copy of the schema in the metastore table 
> properties, which HiveExternalCatalog will read in as the table's schema if 
> it is present. If it is not present, it will fall back to the 
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19888) Seeing offsets not resetting even when reset policy is configured explicitly

2017-03-09 Thread Justin Miller (JIRA)
Justin Miller created SPARK-19888:
-

 Summary: Seeing offsets not resetting even when reset policy is 
configured explicitly
 Key: SPARK-19888
 URL: https://issues.apache.org/jira/browse/SPARK-19888
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Justin Miller


I was told to post this in a Spark ticket from KAFKA-4396:

I've been seeing a curious error with kafka 0.10 (spark 2.11), these may be two 
separate errors, I'm not sure. What's puzzling is that I'm setting 
auto.offset.reset to latest and it's still throwing an 
OffsetOutOfRangeException, behavior that's contrary to the code. Please help! :)

{code}
val kafkaParams = Map[String, Object](
  "group.id" -> consumerGroup,
  "bootstrap.servers" -> bootstrapServers,
  "key.deserializer" -> classOf[ByteArrayDeserializer],
  "value.deserializer" -> classOf[MessageRowDeserializer],
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean),
  "max.poll.records" -> persisterConfig.maxPollRecords,
  "request.timeout.ms" -> persisterConfig.requestTimeoutMs,
  "session.timeout.ms" -> persisterConfig.sessionTimeoutMs,
  "heartbeat.interval.ms" -> persisterConfig.heartbeatIntervalMs,
  "connections.max.idle.ms"-> persisterConfig.connectionsMaxIdleMs
)
{code}

{code}
16/11/09 23:10:17 INFO BlockManagerInfo: Added broadcast_154_piece0 in memory 
on xyz (size: 146.3 KB, free: 8.4 GB)
16/11/09 23:10:23 WARN TaskSetManager: Lost task 15.0 in stage 151.0 (TID 
38837, xyz): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
Offsets out of range with no configured reset policy for partitions: 
{topic=231884473}
at 
org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588)
at 
org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

16/11/09 23:10:29 INFO TaskSetManager: Finished task 10.0 in stage 154.0 (TID 
39388) in 12043 ms on xyz (1/16)
16/11/09 23:10:31 INFO TaskSetManager: Finished task 0.0 in stage 154.0 (TID 
39375) in 13444 ms on xyz (2/16)
16/11/09 23:10:44 WARN TaskSetManager: Lost task 1.0 in stage 151.0 (TID 38843, 
xyz): java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
multi-threaded access
at 
org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at 

[jira] [Assigned] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-03-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19611:
---

Assignee: Adam Budde

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Adam Budde
> Fix For: 2.2.0
>
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for saving a case-sensitive copy of the schema in the metastore table 
> properties, which HiveExternalCatalog will read in as the table's schema if 
> it is present. If it is not present, it will fall back to the 
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-03-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19611.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16944
[https://github.com/apache/spark/pull/16944]

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
> Fix For: 2.2.0
>
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for saving a case-sensitive copy of the schema in the metastore table 
> properties, which HiveExternalCatalog will read in as the table's schema if 
> it is present. If it is not present, it will fall back to the 
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19364) Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are enabled and an exception is thrown

2017-03-09 Thread Andrew Milkowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903778#comment-15903778
 ] 

Andrew Milkowski edited comment on SPARK-19364 at 3/9/17 8:17 PM:
--

thanks @Takeshi Yamamuro , will try to see if I can make this error consistent 
(we see it in prod non stop and it is consistent) I will see if I can throw the 
exception from in the kinesis receiver (java lib) and see stream blocks grow in 
spark, will provide line change to re-produce problem.. it is tied to kinesis 
java lib faulting on checkpoint throwing exception and spark persisting stream 
blocks and never releasing em from memory till eventual OME


was (Author: amilkowski):
thanks @Takeshi Yamamuro , will try to see if I can make this error consistent 
(we see it in prod non stop and it is consistent) I will see if I can throw the 
exception from in the kinesis receiver (java lib) and see stream blocks grow in 
spark, will provide line change to re-produce problem.. it is tired to kinesis 
java lib faulting on checkpoint throwing exception and spark persisting stream 
blocks and never releasing em from memory till eventual OME

> Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are 
> enabled and an exception is thrown 
> --
>
> Key: SPARK-19364
> URL: https://issues.apache.org/jira/browse/SPARK-19364
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: ubuntu unix
> spark 2.0.2
> application is java
>Reporter: Andrew Milkowski
>Priority: Blocker
>
> -- update --- we found that below situation occurs when we encounter
> "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard"
> https://github.com/awslabs/amazon-kinesis-client/issues/108
> we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
> blocks should not get stuck but continue to be evicted gracefully from 
> memory, obviously kinesis library race condition is a problem onto itself...
> -- exception leading to a block not being freed up --
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/mnt/yarn/usercache/hadoop/filecache/24/__spark_libs__7928020266533182031.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 17/01/26 13:52:00 ERROR KinesisRecordProcessor: ShutdownException:  Caught 
> shutdown exception, skipping checkpoint.
> com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:144)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
>   at 
> 

[jira] [Commented] (SPARK-19364) Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are enabled and an exception is thrown

2017-03-09 Thread Andrew Milkowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903778#comment-15903778
 ] 

Andrew Milkowski commented on SPARK-19364:
--

thanks @Takeshi Yamamuro , will try to see if I can make this error consistent 
(we see it in prod non stop and it is consistent) I will see if I can throw the 
exception from in the kinesis receiver (java lib) and see stream blocks grow in 
spark, will provide line change to re-produce problem.. it is tired to kinesis 
java lib faulting on checkpoint throwing exception and spark persisting stream 
blocks and never releasing em from memory till eventual OME

> Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are 
> enabled and an exception is thrown 
> --
>
> Key: SPARK-19364
> URL: https://issues.apache.org/jira/browse/SPARK-19364
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: ubuntu unix
> spark 2.0.2
> application is java
>Reporter: Andrew Milkowski
>Priority: Blocker
>
> -- update --- we found that below situation occurs when we encounter
> "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard"
> https://github.com/awslabs/amazon-kinesis-client/issues/108
> we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
> blocks should not get stuck but continue to be evicted gracefully from 
> memory, obviously kinesis library race condition is a problem onto itself...
> -- exception leading to a block not being freed up --
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/mnt/yarn/usercache/hadoop/filecache/24/__spark_libs__7928020266533182031.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 17/01/26 13:52:00 ERROR KinesisRecordProcessor: ShutdownException:  Caught 
> shutdown exception, skipping checkpoint.
> com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:144)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
>   at 
> org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
>   at 
> org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
>   at 
> org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
> running standard kinesis stream ingestion with a 

[jira] [Resolved] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2017-03-09 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-12334.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 10307
[https://github.com/apache/spark/pull/10307]

> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
> Fix For: 2.2.0
>
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2017-03-09 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-12334:
---

Assignee: Jeff Zhang

> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 2.2.0
>
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Summary: __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition 
value in partitioned persisted tables  (was: __HIVE_DEFAULT_PARTITION__ not 
interpreted as NULL partition value in partitioned persisted tables)

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned table uses magic string 
> {{"__HIVE_DEFAULT_PARTITION__"}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-19887:
--

 Summary: __HIVE_DEFAULT_PARTITION__ not interpreted as NULL 
partition value in partitioned persisted tables
 Key: SPARK-19887
 URL: https://issues.apache.org/jira/browse/SPARK-19887
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Cheng Lian


The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned table uses magic string {{"__HIVE_DEFAULT_PARTITION__"}} 
to indicate {{NULL}} partition values in partition directory names. However, in 
the case persisted partitioned table, this magic string is not interpreted as 
{{NULL}} but a regular string.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903749#comment-15903749
 ] 

Cheng Lian commented on SPARK-19887:


cc [~cloud_fan]

> __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in 
> partitioned persisted tables
> --
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned table uses magic string 
> {{"__HIVE_DEFAULT_PARTITION__"}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19886:


Assignee: Apache Spark  (was: Burak Yavuz)

> reportDataLoss cause != null check is wrong for Structured Streaming 
> KafkaSource
> 
>
> Key: SPARK-19886
> URL: https://issues.apache.org/jira/browse/SPARK-19886
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19886:


Assignee: Burak Yavuz  (was: Apache Spark)

> reportDataLoss cause != null check is wrong for Structured Streaming 
> KafkaSource
> 
>
> Key: SPARK-19886
> URL: https://issues.apache.org/jira/browse/SPARK-19886
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903733#comment-15903733
 ] 

Apache Spark commented on SPARK-19886:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/17228

> reportDataLoss cause != null check is wrong for Structured Streaming 
> KafkaSource
> 
>
> Key: SPARK-19886
> URL: https://issues.apache.org/jira/browse/SPARK-19886
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-19886:
---

 Summary: reportDataLoss cause != null check is wrong for 
Structured Streaming KafkaSource
 Key: SPARK-19886
 URL: https://issues.apache.org/jira/browse/SPARK-19886
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19861) watermark should not be a negative time.

2017-03-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19861.
--
   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.2.0
   2.1.1

> watermark should not be a negative time.
> 
>
> Key: SPARK-19861
> URL: https://issues.apache.org/jira/browse/SPARK-19861
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903680#comment-15903680
 ] 

Apache Spark commented on SPARK-19507:
--

User 'dgingrich' has created a pull request for this issue:
https://github.com/apache/spark/pull/17227

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Priority: Trivial
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19715) Option to Strip Paths in FileSource

2017-03-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19715.
--
   Resolution: Fixed
 Assignee: Liwei Lin
Fix Version/s: 2.2.0

> Option to Strip Paths in FileSource
> ---
>
> Key: SPARK-19715
> URL: https://issues.apache.org/jira/browse/SPARK-19715
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Liwei Lin
> Fix For: 2.2.0
>
>
> Today, we compare the whole path when deciding if a file is new in the 
> FileSource for structured streaming.  However, this cause cause false 
> negatives in the case where the path has changed in a cosmetic way (i.e. 
> changing s3n to s3a).  We should add an option {{fileNameOnly}} that causes 
> the new file check to be based only on the filename (but still store the 
> whole path in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19882) Pivot with null as the pivot value throws NPE

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903643#comment-15903643
 ] 

Apache Spark commented on SPARK-19882:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/17226

> Pivot with null as the pivot value throws NPE
> -
>
> Key: SPARK-19882
> URL: https://issues.apache.org/jira/browse/SPARK-19882
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> This seems a regression.
> - Spark 1.6
> {code}
> Seq(Tuple1(None), 
> Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count().show()
> {code}
> prints
> {code}
> +++---+
> |   a|null|  1|
> +++---+
> |null|   0|  0|
> |   1|   0|  1|
> +++---+
> {code}
> - Current master
> {code}
> Seq(Tuple1(None), 
> Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count().show()
> {code}
> prints
> {code}
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:145)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:143)
>   at scala.collection.immutable.List.map(List.scala:273)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst.(PivotFirst.scala:143)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot$$anonfun$apply$7$$anonfun$24.apply(Analyzer.scala:509)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19793) Use clock.getTimeMillis when mark task as finished in TaskSetManager.

2017-03-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19793.

   Resolution: Fixed
 Assignee: jin xing
Fix Version/s: 2.2.0

> Use clock.getTimeMillis when mark task as finished in TaskSetManager.
> -
>
> Key: SPARK-19793
> URL: https://issues.apache.org/jira/browse/SPARK-19793
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Minor
> Fix For: 2.2.0
>
>
> TaskSetManager is now using *System.getCurrentTimeMillis* when mark task as 
> finished in *handleSuccessfulTask* and *handleFailedTask*. Thus developer 
> cannot set the tasks finishing time in unit test. When 
> *handleSuccessfulTask*, task's duration = System.getCurrentTimeMillis - 
> launchTime(which can be set by *clock*), the result is not correct.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19757) Executor with task scheduled could be killed due to idleness

2017-03-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-19757:
---
Reporter: jin xing  (was: Jimmy Xiang)

> Executor with task scheduled could be killed due to idleness
> 
>
> Key: SPARK-19757
> URL: https://issues.apache.org/jira/browse/SPARK-19757
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: jin xing
>Assignee: Jimmy Xiang
>Priority: Minor
> Fix For: 2.2.0
>
>
> With dynamic executor allocation enabled on yarn mode, after one job is 
> finished for a while, submit another job, then there is race between killing 
> idle executors and scheduling new task on these executors. Sometimes, some 
> executor is killed right after a task is scheduled on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19757) Executor with task scheduled could be killed due to idleness

2017-03-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19757.

   Resolution: Fixed
 Assignee: Jimmy Xiang
Fix Version/s: 2.2.0

> Executor with task scheduled could be killed due to idleness
> 
>
> Key: SPARK-19757
> URL: https://issues.apache.org/jira/browse/SPARK-19757
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Minor
> Fix For: 2.2.0
>
>
> With dynamic executor allocation enabled on yarn mode, after one job is 
> finished for a while, submit another job, then there is race between killing 
> idle executors and scheduling new task on these executors. Sometimes, some 
> executor is killed right after a task is scheduled on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19885) The config ignoreCorruptFiles doesn't work for CSV

2017-03-09 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19885:


 Summary: The config ignoreCorruptFiles doesn't work for CSV
 Key: SPARK-19885
 URL: https://issues.apache.org/jira/browse/SPARK-19885
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Shixiong Zhu


CSVFileFormat.inferSchema doesn't use FileScanRDD so the SQL 
"ignoreCorruptFiles" doesn't work.

{code}
java.io.EOFException: Unexpected end of input stream
at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
at 
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
at 
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
at org.apache.spark.rdd.RDD$$anonfun$aggregate$1.apply(RDD.scala:1114)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at 

[jira] [Commented] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2017-03-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903584#comment-15903584
 ] 

Kazuaki Ishizaki commented on SPARK-19875:
--

I got the following stack trace. This stuck seems to occur at constraint 
propagation.
When I applied SPARK-19846 to this, I do not get stuck with a 50-column csv 
dataset.

{code}
"ScalaTest-run-running-Test" #1 prio=5 os_prio=0 tid=0x02a41000 
nid=0x1bb4c runnable [0x02d5b000]
   java.lang.Thread.State: RUNNABLE
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:148)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40)
at 
scala.collection.mutable.FlatHashTable$class.growTable(FlatHashTable.scala:225)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:159)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40)
at 
scala.collection.mutable.FlatHashTable$class.addElem(FlatHashTable.scala:139)
at scala.collection.mutable.HashSet.addElem(HashSet.scala:40)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:59)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:40)
at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.AbstractSet.$plus$plus$eq(Set.scala:46)
at scala.collection.mutable.HashSet.clone(HashSet.scala:83)
at scala.collection.mutable.HashSet.clone(HashSet.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:65)
at 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:50)
at 
scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
at 
scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
at 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104)
at 
scala.collection.TraversableOnce$class.$div$colon(TraversableOnce.scala:151)
at 
scala.collection.AbstractTraversable.$div$colon(Traversable.scala:104)
at scala.collection.SetLike$class.$plus$plus(SetLike.scala:141)
at 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:50)
at 
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:325)
at 
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:322)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:322)
at 
org.apache.spark.sql.catalyst.plans.logical.Project.validConstraints(basicLogicalOperators.scala:57)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints$lzycompute(QueryPlan.scala:187)
- locked <0x000682bd9118> (a 
org.apache.spark.sql.catalyst.plans.logical.Project)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints(QueryPlan.scala:187)
at 
org.apache.spark.sql.catalyst.plans.logical.Filter.validConstraints(basicLogicalOperators.scala:130)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints$lzycompute(QueryPlan.scala:187)
- locked <0x000682bd9188> (a 
org.apache.spark.sql.catalyst.plans.logical.Filter)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints(QueryPlan.scala:187)
at 
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints$$anonfun$apply$13.applyOrElse(Optimizer.scala:612)
at 
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints$$anonfun$apply$13.applyOrElse(Optimizer.scala:610)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 

[jira] [Comment Edited] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-03-09 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903532#comment-15903532
 ] 

Amit Sela edited comment on SPARK-19067 at 3/9/17 6:15 PM:
---

[~tdas] I just read the PR, and I'm very excited for Spark to provide such a 
powerful stateful operator!
I added a few comments in the PR based on my first impressions as a user, hope 
you don't mind.
I assume that for event-time-timeouts you'd look at the Watermark time instead 
of Wall time, correct ? how would that work ? If I get it right it's all 
represented as a table so the "Watermark Manager" would constantly right 
updates to the table in the "Watermark Column" ?


was (Author: amitsela):
[~tdas] I just read the PR, and I'm very excited for Spark to provide such a 
powerful stateful operator!
I added a few comments in the PR based on my first impressions, hope you don't 
mind.
I assume that for event-time-timeouts you'd look at the Watermark time instead 
of Wall time, correct ? how would that work ? If I get it right it's all 
represented as a table so the "Watermark Manager" would constantly right 
updates to the table in the "Watermark Column" ?

> mapGroupsWithState - arbitrary stateful operations with Structured Streaming 
> (similar to DStream.mapWithState)
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{DStream.mapWithState}}) to 
> structured streaming. Here is the design. 
> *Requirements*
> - Users should be able to specify a function that can do the following
> - Access the input row corresponding to a key
> - Access the previous state corresponding to a key
> - Optionally, update or remove the state
> - Output any number of new rows (or none at all)
> *Proposed API*
> {code}
> //  New methods on KeyValueGroupedDataset 
> class KeyValueGroupedDataset[K, V] {  
>   // Scala friendly
>   def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
> State[S]) => U)
> def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
> Iterator[V], State[S]) => Iterator[U])
>   // Java friendly
>def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
> R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
>def flatMapGroupsWithState[S, U](func: 
> FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
> resultEncoder: Encoder[U])
> }
> // --- New Java-friendly function classes --- 
> public interface MapGroupsWithStateFunction extends Serializable {
>   R call(K key, Iterator values, state: State) throws Exception;
> }
> public interface FlatMapGroupsWithStateFunction extends 
> Serializable {
>   Iterator call(K key, Iterator values, state: State) throws 
> Exception;
> }
> // -- Wrapper class for state data -- 
> trait KeyedState[S] {
>   def exists(): Boolean   
>   def get(): S// throws Exception is state does not 
> exist
>   def getOption(): Option[S]   
>   def update(newState: S): Unit
>   def remove(): Unit  // exists() will be false after this
> }
> {code}
> Key Semantics of the State class
> - The state can be null.
> - If the state.remove() is called, then state.exists() will return false, and 
> getOption will returm None.
> - After that state.update(newState) is called, then state.exists() will 
> return true, and getOption will return Some(...). 
> - None of the operations are thread-safe. This is to avoid memory barriers.
> *Usage*
> {code}
> val stateFunc = (word: String, words: Iterator[String, runningCount: 
> KeyedState[Long]) => {
> val newCount = words.size + runningCount.getOption.getOrElse(0L)
> runningCount.update(newCount)
>(word, newCount)
> }
> dataset   // type 
> is Dataset[String]
>   .groupByKey[String](w => w) // generates 
> KeyValueGroupedDataset[String, String]
>   .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns 
> Dataset[(String, Long)]
> {code}
> *Future Directions*
> - Timeout based state expiration (that has not received data for a while)
> - 

[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-03-09 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903532#comment-15903532
 ] 

Amit Sela commented on SPARK-19067:
---

[~tdas] I just read the PR, and I'm very excited for Spark to provide such a 
powerful stateful operator!
I added a few comments in the PR based on my first impressions, hope you don't 
mind.
I assume that for event-time-timeouts you'd look at the Watermark time instead 
of Wall time, correct ? how would that work ? If I get it right it's all 
represented as a table so the "Watermark Manager" would constantly right 
updates to the table in the "Watermark Column" ?

> mapGroupsWithState - arbitrary stateful operations with Structured Streaming 
> (similar to DStream.mapWithState)
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{DStream.mapWithState}}) to 
> structured streaming. Here is the design. 
> *Requirements*
> - Users should be able to specify a function that can do the following
> - Access the input row corresponding to a key
> - Access the previous state corresponding to a key
> - Optionally, update or remove the state
> - Output any number of new rows (or none at all)
> *Proposed API*
> {code}
> //  New methods on KeyValueGroupedDataset 
> class KeyValueGroupedDataset[K, V] {  
>   // Scala friendly
>   def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
> State[S]) => U)
> def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
> Iterator[V], State[S]) => Iterator[U])
>   // Java friendly
>def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
> R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
>def flatMapGroupsWithState[S, U](func: 
> FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
> resultEncoder: Encoder[U])
> }
> // --- New Java-friendly function classes --- 
> public interface MapGroupsWithStateFunction extends Serializable {
>   R call(K key, Iterator values, state: State) throws Exception;
> }
> public interface FlatMapGroupsWithStateFunction extends 
> Serializable {
>   Iterator call(K key, Iterator values, state: State) throws 
> Exception;
> }
> // -- Wrapper class for state data -- 
> trait KeyedState[S] {
>   def exists(): Boolean   
>   def get(): S// throws Exception is state does not 
> exist
>   def getOption(): Option[S]   
>   def update(newState: S): Unit
>   def remove(): Unit  // exists() will be false after this
> }
> {code}
> Key Semantics of the State class
> - The state can be null.
> - If the state.remove() is called, then state.exists() will return false, and 
> getOption will returm None.
> - After that state.update(newState) is called, then state.exists() will 
> return true, and getOption will return Some(...). 
> - None of the operations are thread-safe. This is to avoid memory barriers.
> *Usage*
> {code}
> val stateFunc = (word: String, words: Iterator[String, runningCount: 
> KeyedState[Long]) => {
> val newCount = words.size + runningCount.getOption.getOrElse(0L)
> runningCount.update(newCount)
>(word, newCount)
> }
> dataset   // type 
> is Dataset[String]
>   .groupByKey[String](w => w) // generates 
> KeyValueGroupedDataset[String, String]
>   .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns 
> Dataset[(String, Long)]
> {code}
> *Future Directions*
> - Timeout based state expiration (that has not received data for a while)
> - General expression based expiration 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19884) Add the ability to get all registered functions from a SparkSession

2017-03-09 Thread Yael Aharon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yael Aharon resolved SPARK-19884.
-
Resolution: Not A Problem

Thank you so much for your reply

> Add the ability to get all registered functions from a SparkSession
> ---
>
> Key: SPARK-19884
> URL: https://issues.apache.org/jira/browse/SPARK-19884
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yael Aharon
>
> It would be very useful to get the list of functions that are registered with 
> a SparkSession. Built-in and otherwise.
> This would be useful e.g. for auto-completion support in editors built around 
> Spark SQL.
> thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19884) Add the ability to get all registered functions from a SparkSession

2017-03-09 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903257#comment-15903257
 ] 

Herman van Hovell commented on SPARK-19884:
---

You can use the catalog for that by calling 
{{spark.catalog.listFunctions().show()}}, or you can use sql and issue the 
{{show functions}} command.

> Add the ability to get all registered functions from a SparkSession
> ---
>
> Key: SPARK-19884
> URL: https://issues.apache.org/jira/browse/SPARK-19884
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yael Aharon
>
> It would be very useful to get the list of functions that are registered with 
> a SparkSession. Built-in and otherwise.
> This would be useful e.g. for auto-completion support in editors built around 
> Spark SQL.
> thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17248) Add native Scala enum support to Dataset Encoders

2017-03-09 Thread Lee Dongjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903245#comment-15903245
 ] 

Lee Dongjin commented on SPARK-17248:
-

[~pdxleif] // Although it may be an expired question, let me answer.

There are two ways of implementing Enum types in Scala. For more information, 
see: 
http://alvinalexander.com/scala/how-to-use-scala-enums-enumeration-examples The 
'enumeration object' in the tuning guide seems to point the second way, that 
is, using sealed traits and case objects.

However, it seems like the sentence "Consider using numeric IDs or enumeration 
objects instead of strings for keys." in the tuning guide does not apply to 
DataSet, like your case. From my humble knowledge, DataSet supports only case 
classes with primitive types, not with the other case class or object, in this 
case, the enum objects.

If you want some workaround, please check out this example: 
https://github.com/dongjinleekr/spark-dataset/blob/master/src/main/scala/com/github/dongjinleekr/spark/dataset/Titanic.scala
 It shows an example of DataSet using one of the public datasets. Please pay 
attention to how I matched the Passenger case class and its corresponding 
SchemaType - age, pClass, sex and embarked.

[~srowen] // It seems like lots of users are experiencing similar problems. How 
about changing this issue into providing more examples and explanations in 
official documentation? I needed, I would like to take the issue.

> Add native Scala enum support to Dataset Encoders
> -
>
> Key: SPARK-17248
> URL: https://issues.apache.org/jira/browse/SPARK-17248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Silvio Fiorito
>
> Enable support for Scala enums in Encoders. Ideally, users should be able to 
> use enums as part of case classes automatically.
> Currently, this code...
> {code}
> object MyEnum extends Enumeration {
>   type MyEnum = Value
>   val EnumVal1, EnumVal2 = Value
> }
> case class MyClass(col: MyEnum.MyEnum)
> val data = Seq(MyClass(MyEnum.EnumVal1), MyClass(MyEnum.EnumVal2)).toDS()
> {code}
> ...results in this stacktrace:
> {code}
> ava.lang.UnsupportedOperationException: No Encoder found for MyEnum.MyEnum
> - field (class: "scala.Enumeration.Value", name: "col")
> - root class: 
> "line550c9f34c5144aa1a1e76bcac863244717.$read.$iwC.$iwC.$iwC.$iwC.MyClass"
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:598)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
>   at org.apache.spark.sql.Encoders$.product(Encoders.scala:274)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19884) Add the ability to get all registered functions from a SparkSession

2017-03-09 Thread Yael Aharon (JIRA)
Yael Aharon created SPARK-19884:
---

 Summary: Add the ability to get all registered functions from a 
SparkSession
 Key: SPARK-19884
 URL: https://issues.apache.org/jira/browse/SPARK-19884
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Yael Aharon


It would be very useful to get the list of functions that are registered with a 
SparkSession. Built-in and otherwise.

This would be useful e.g. for auto-completion support in editors built around 
Spark SQL.
thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902991#comment-15902991
 ] 

Hyukjin Kwon commented on SPARK-15474:
--

I see. Thank you for your advice. Let me maybe give a shot if no one gets 
interested in this.

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19300) Executor is waiting for lock

2017-03-09 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19300:

Comment: was deleted

(was: [~zsxwing] I also met this issue, and you can look  
https://issues.apache.org/jira/browse/SPARK-19883. And my netty's version  
already is  4.0.43.Final. )

> Executor is waiting for lock
> 
>
> Key: SPARK-19300
> URL: https://issues.apache.org/jira/browse/SPARK-19300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Priority: Critical
> Attachments: stderr.jpg, WAITING.jpg
>
>
> I can see all threads in the executor is waiting for lock.
> {code}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:313)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:138)
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
> org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:342)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:293)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> org.apache.spark.scheduler.Task.run(Task.scala:99)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:330)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SPARK-19300) Executor is waiting for lock

2017-03-09 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902988#comment-15902988
 ] 

hustfxj commented on SPARK-19300:
-

[~zsxwing] I also met this issue, and you can look  
https://issues.apache.org/jira/browse/SPARK-19883. And my netty's version  
already is  4.0.43.Final. 

> Executor is waiting for lock
> 
>
> Key: SPARK-19300
> URL: https://issues.apache.org/jira/browse/SPARK-19300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Priority: Critical
> Attachments: stderr.jpg, WAITING.jpg
>
>
> I can see all threads in the executor is waiting for lock.
> {code}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:313)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:138)
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
> org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:342)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:293)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> org.apache.spark.scheduler.Task.run(Task.scala:99)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:330)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902990#comment-15902990
 ] 

Apache Spark commented on SPARK-19851:
--

User 'ptkool' has created a pull request for this issue:
https://github.com/apache/spark/pull/17194

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19851:


Assignee: Apache Spark

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>Assignee: Apache Spark
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19851:


Assignee: (was: Apache Spark)

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19829) The log about driver should support rolling like executor

2017-03-09 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj closed SPARK-19829.
---
Resolution: Invalid

> The log about driver should support rolling like executor
> -
>
> Key: SPARK-19829
> URL: https://issues.apache.org/jira/browse/SPARK-19829
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hustfxj
>Priority: Minor
>
> We should rollback the log of the driver , or the log maybe large!!! 
> {code:title=DriverRunner.java|borderStyle=solid}
> // modify the runDriver
>   private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: 
> Boolean): Int = {
> builder.directory(baseDir)
> def initialize(process: Process): Unit = {
>   // Redirect stdout and stderr to files-- the old code
> //  val stdout = new File(baseDir, "stdout")
> //  CommandUtils.redirectStream(process.getInputStream, stdout)
> //
> //  val stderr = new File(baseDir, "stderr")
> //  val formattedCommand = builder.command.asScala.mkString("\"", "\" 
> \"", "\"")
> //  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, 
> "=" * 40)
> //  Files.append(header, stderr, StandardCharsets.UTF_8)
> //  CommandUtils.redirectStream(process.getErrorStream, stderr)
>   // Redirect its stdout and stderr to files-support rolling
>   val stdout = new File(baseDir, "stdout")
>   stdoutAppender = FileAppender(process.getInputStream, stdout, conf)
>   val stderr = new File(baseDir, "stderr")
>   val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
> "\"")
>   val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" 
> * 40)
>   Files.append(header, stderr, StandardCharsets.UTF_8)
>   stderrAppender = FileAppender(process.getErrorStream, stderr, conf)
> }
> runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19829) The log about driver should support rolling like executor

2017-03-09 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902982#comment-15902982
 ] 

hustfxj commented on SPARK-19829:
-

ok.

> The log about driver should support rolling like executor
> -
>
> Key: SPARK-19829
> URL: https://issues.apache.org/jira/browse/SPARK-19829
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hustfxj
>Priority: Minor
>
> We should rollback the log of the driver , or the log maybe large!!! 
> {code:title=DriverRunner.java|borderStyle=solid}
> // modify the runDriver
>   private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: 
> Boolean): Int = {
> builder.directory(baseDir)
> def initialize(process: Process): Unit = {
>   // Redirect stdout and stderr to files-- the old code
> //  val stdout = new File(baseDir, "stdout")
> //  CommandUtils.redirectStream(process.getInputStream, stdout)
> //
> //  val stderr = new File(baseDir, "stderr")
> //  val formattedCommand = builder.command.asScala.mkString("\"", "\" 
> \"", "\"")
> //  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, 
> "=" * 40)
> //  Files.append(header, stderr, StandardCharsets.UTF_8)
> //  CommandUtils.redirectStream(process.getErrorStream, stderr)
>   // Redirect its stdout and stderr to files-support rolling
>   val stdout = new File(baseDir, "stdout")
>   stdoutAppender = FileAppender(process.getInputStream, stdout, conf)
>   val stderr = new File(baseDir, "stderr")
>   val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
> "\"")
>   val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" 
> * 40)
>   Files.append(header, stderr, StandardCharsets.UTF_8)
>   stderrAppender = FileAppender(process.getErrorStream, stderr, conf)
> }
> runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19883) Executor is waiting for lock

2017-03-09 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902980#comment-15902980
 ] 

hustfxj commented on SPARK-19883:
-

[~srowen] Saw this issue again today. So spark may not solve this issue.

> Executor is waiting for lock
> 
>
> Key: SPARK-19883
> URL: https://issues.apache.org/jira/browse/SPARK-19883
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: hustfxj
> Attachments: jstack
>
>
> {code:borderStyle=solid}
> "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 
> tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000498c249c0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
> at org.apache.spark.scheduler.Task.run(Task.scala:114)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19883) Executor is waiting for lock

2017-03-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19883.
---
Resolution: Duplicate

This is at best quite incomplete, but you are saying it is a duplicate. 

> Executor is waiting for lock
> 
>
> Key: SPARK-19883
> URL: https://issues.apache.org/jira/browse/SPARK-19883
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: hustfxj
> Attachments: jstack
>
>
> {code:borderStyle=solid}
> "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 
> tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000498c249c0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
> at org.apache.spark.scheduler.Task.run(Task.scala:114)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19882) Pivot with null as the pivot value throws NPE

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19882:


Assignee: (was: Apache Spark)

> Pivot with null as the pivot value throws NPE
> -
>
> Key: SPARK-19882
> URL: https://issues.apache.org/jira/browse/SPARK-19882
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> This seems a regression.
> - Spark 1.6
> {code}
> Seq(Tuple1(None), 
> Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count().show()
> {code}
> prints
> {code}
> +++---+
> |   a|null|  1|
> +++---+
> |null|   0|  0|
> |   1|   0|  1|
> +++---+
> {code}
> - Current master
> {code}
> Seq(Tuple1(None), 
> Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count().show()
> {code}
> prints
> {code}
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:145)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:143)
>   at scala.collection.immutable.List.map(List.scala:273)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst.(PivotFirst.scala:143)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot$$anonfun$apply$7$$anonfun$24.apply(Analyzer.scala:509)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19882) Pivot with null as the pivot value throws NPE

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902977#comment-15902977
 ] 

Apache Spark commented on SPARK-19882:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17224

> Pivot with null as the pivot value throws NPE
> -
>
> Key: SPARK-19882
> URL: https://issues.apache.org/jira/browse/SPARK-19882
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> This seems a regression.
> - Spark 1.6
> {code}
> Seq(Tuple1(None), 
> Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count().show()
> {code}
> prints
> {code}
> +++---+
> |   a|null|  1|
> +++---+
> |null|   0|  0|
> |   1|   0|  1|
> +++---+
> {code}
> - Current master
> {code}
> Seq(Tuple1(None), 
> Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count().show()
> {code}
> prints
> {code}
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:145)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:143)
>   at scala.collection.immutable.List.map(List.scala:273)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst.(PivotFirst.scala:143)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot$$anonfun$apply$7$$anonfun$24.apply(Analyzer.scala:509)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19883) Executor is waiting for lock

2017-03-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-19883.
-

> Executor is waiting for lock
> 
>
> Key: SPARK-19883
> URL: https://issues.apache.org/jira/browse/SPARK-19883
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: hustfxj
> Attachments: jstack
>
>
> {code:borderStyle=solid}
> "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 
> tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000498c249c0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
> at org.apache.spark.scheduler.Task.run(Task.scala:114)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19882) Pivot with null as the pivot value throws NPE

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19882:


Assignee: Apache Spark

> Pivot with null as the pivot value throws NPE
> -
>
> Key: SPARK-19882
> URL: https://issues.apache.org/jira/browse/SPARK-19882
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> This seems a regression.
> - Spark 1.6
> {code}
> Seq(Tuple1(None), 
> Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count().show()
> {code}
> prints
> {code}
> +++---+
> |   a|null|  1|
> +++---+
> |null|   0|  0|
> |   1|   0|  1|
> +++---+
> {code}
> - Current master
> {code}
> Seq(Tuple1(None), 
> Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count().show()
> {code}
> prints
> {code}
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:145)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:143)
>   at scala.collection.immutable.List.map(List.scala:273)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst.(PivotFirst.scala:143)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot$$anonfun$apply$7$$anonfun$24.apply(Analyzer.scala:509)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19883) Executor is waiting for lock

2017-03-09 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902972#comment-15902972
 ] 

hustfxj edited comment on SPARK-19883 at 3/9/17 12:25 PM:
--

[~srowen] A stage is still blocked because a task of this stage is still 
waiting like that. This stage start block from 11:20am. I don' think it is not 
a problem. And it is like the issue:
https://issues.apache.org/jira/browse/SPARK-19300






was (Author: hustfxj):
[~srowen] A stage is still blocked because a task of this stage is still 
waiting like that. This stage start block from 11:20am.
And the executor's log like that:




> Executor is waiting for lock
> 
>
> Key: SPARK-19883
> URL: https://issues.apache.org/jira/browse/SPARK-19883
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: hustfxj
> Attachments: jstack
>
>
> {code:borderStyle=solid}
> "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 
> tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000498c249c0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
> at org.apache.spark.scheduler.Task.run(Task.scala:114)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >