date:20180722

[jira] [Assigned] (SPARK-24886) Increase Jenkins build time

2018-07-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24886:


Assignee: Apache Spark

> Increase Jenkins build time
> ---
>
> Key: SPARK-24886
> URL: https://issues.apache.org/jira/browse/SPARK-24886
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, looks we hit the time limit time to time. Looks better increasing 
> the time a bit.
> For instance, please see https://github.com/apache/spark/pull/21822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24886) Increase Jenkins build time

2018-07-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552347#comment-16552347
 ] 

Apache Spark commented on SPARK-24886:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21845

> Increase Jenkins build time
> ---
>
> Key: SPARK-24886
> URL: https://issues.apache.org/jira/browse/SPARK-24886
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, looks we hit the time limit time to time. Looks better increasing 
> the time a bit.
> For instance, please see https://github.com/apache/spark/pull/21822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24886) Increase Jenkins build time

2018-07-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24886:


Assignee: (was: Apache Spark)

> Increase Jenkins build time
> ---
>
> Key: SPARK-24886
> URL: https://issues.apache.org/jira/browse/SPARK-24886
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, looks we hit the time limit time to time. Looks better increasing 
> the time a bit.
> For instance, please see https://github.com/apache/spark/pull/21822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24886) Increase Jenkins build time

2018-07-22 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-24886:


 Summary: Increase Jenkins build time
 Key: SPARK-24886
 URL: https://issues.apache.org/jira/browse/SPARK-24886
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 2.4.0
Reporter: Hyukjin Kwon


Currently, looks we hit the time limit time to time. Looks better increasing 
the time a bit.

For instance, please see https://github.com/apache/spark/pull/21822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2018-07-22 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552257#comment-16552257
 ] 

Takeshi Yamamuro commented on SPARK-18492:
--

[~MDS Tang] You'd be better to describe more about you env, e.g., spark 
version, query to reproduce, ...

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>Priority: Major
> Attachments: Screenshot from 2018-03-02 12-57-51.png
>
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData project_value252 = null;

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2018-07-22 Thread Tang Yu Jie (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552242#comment-16552242
 ] 

Tang Yu Jie commented on SPARK-18492:
-

Here I also encountered this problem, have you resolved it?  

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>Priority: Major
> Attachments: Screenshot from 2018-03-02 12-57-51.png
>
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData project_value252 = null;
> /* 12269 */   if (!project_isNull252) {
> /* 122

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-22 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552233#comment-16552233
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] what you mentioned above is also what we think about and try to 
figure out a way to solve it. (this problem also existed in barrier execution).

>From user point, specifying resource through RDD is the only feasible way 
>currently what I can think, though resource is bound to stage/task not 
>particular RDD. This means user could specify resources for different RDDs in 
>a single stage, Spark can only use one resource within this stage. This will 
>bring out several problems as you mentioned:

*Specify resources to which RDD*

For example {{rddA.withResource.mapPartition \{ xxx \}.collect()}} is not 
different from {{rddA.mapPartition \{ xxx \}.withResource.collect}}. Since all 
the rdds are executed in the same stage. So in the current design, not matter 
the resource is specified with {{rddA}} or mapped RDD, the result is the same.

*one to one dependency RDDs with different resources*

For example {{rddA.withResource.mapPartition \{ xxx \}.withResource.collec()}}, 
here assuming the resource request for {{rddA}} and mapped RDD is different, 
since they're running in a single stage, so we should fix such conflict.

*multiple dependencies RDDs with different resources*

For example:

{code}
val rddA = rdd.withResources.mapPartitions()

val rddB = rdd.withResources.mapPartitions()

val rddC = rddA.join(rddB)
{code}

If the resources in {{rddA}} is different from {{rddB}}, then we should also 
fix such conflicts.

Previously I proposed to use largest resource requirement to satisfy all the 
needs. But it may also cause the resource wasting, [~mengxr] mentioned to 
set/merge resources per partition to avoid waste. In the meanwhile, it there's 
a API exposed to set resources in the stage level, then this problem will not 
be existed, but Spark doesn't expose such APIs to user, the only thing user can 
specify is from RDD level, I'm still thinking of a good way to fix it.
 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24841) Memory leak in converting spark dataframe to pandas dataframe

2018-07-22 Thread Kazuaki Ishizaki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552231#comment-16552231
 ] 

Kazuaki Ishizaki commented on SPARK-24841:
--

Thank you for reporting an issue with heap profiling. Would it be possible to 
post a standalone program that can reproduce this problem?

> Memory leak in converting spark dataframe to pandas dataframe
> -
>
> Key: SPARK-24841
> URL: https://issues.apache.org/jira/browse/SPARK-24841
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Running PySpark in standalone mode
>Reporter: Piyush Seth
>Priority: Minor
>
> I am running a continuous running application using PySpark. In one of the 
> operations I have to convert PySpark data frame to Pandas data frame using 
> toPandas API  on pyspark driver. After running for a while I am getting 
> "java.lang.OutOfMemoryError: GC overhead limit exceeded" error.
> I tried running this in a loop and could see that the heap memory is 
> increasing continuously. When I ran jmap for the first time I had the 
> following top rows:
>  num #instances #bytes  class name
> --
>    1:  1757  411477568  [J
> {color:#FF}   *2:    124188  266323152  [C*{color}
>    3:    167219   46821320  org.apache.spark.status.TaskDataWrapper
>    4: 69683   27159536  [B
>    5:    359278    8622672  java.lang.Long
>    6:    221808    7097856  
> java.util.concurrent.ConcurrentHashMap$Node
>    7:    283771    6810504  scala.collection.immutable.$colon$colon
> After running several iterations I had the following
>  num #instances #bytes  class name
> --
> {color:#FF}   *1:    110760 3439887928  [C*{color}
>    2:   698  411429088  [J
>    3:    238096   6880  org.apache.spark.status.TaskDataWrapper
>    4: 68819   24050520  [B
>    5:    498308   11959392  java.lang.Long
>    6:    292741    9367712  
> java.util.concurrent.ConcurrentHashMap$Node
>    7:    282878    6789072  scala.collection.immutable.$colon$colon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24859) Predicates pushdown on outer joins

2018-07-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24859.
--
Resolution: Cannot Reproduce

Seems fixed in the master branch. Please let me know if you have a JIRA fixing 
this. Let me leave this resolved for now.

> Predicates pushdown on outer joins
> --
>
> Key: SPARK-24859
> URL: https://issues.apache.org/jira/browse/SPARK-24859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Cloudera CDH 5.13.1
>Reporter: Johannes Mayer
>Priority: Major
>
> I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
> common column called part_col. Now I want to join both tables on their id but 
> only for some of partitions.
> If I use an inner join, everything works well:
>  
> {code:java}
> select *
> from FA f
> join DI d
> on(f.id = d.id and f.part_col = d.part_col)
> where f.part_col = 'xyz'
> {code}
>  
> In the sql explain plan I can see, that the predicate part_col = 'xyz' is 
> also used in the DIm HiveTableScan.
>  
> When I execute the same query using a left join the full dim table is 
> scanned. There are some workarounds for this issue, but I wanted to report 
> this as a bug, since it works on an inner join, and i think the behaviour 
> should be the same for an outer join.
> Here is a self contained example (created in Zeppelin):
>  
> {code:java}
> val fact = Seq((1, 100), (2, 200), (3,100), (4,200)).toDF("id", "part_col")
> val dim = Seq((1, 100), (2, 200)).toDF("id", "part_col")
> fact.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/fact")
> dim.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/dim")
>  
> spark.sqlContext.sql("create table if not exists fact(id int) partitioned by 
> (part_col int) stored as avro location '/tmp/jira/fact'")
> spark.sqlContext.sql("msck repair table fact") spark.sqlContext.sql("create 
> table if not exists dim(id int) partitioned by (part_col int) stored as avro 
> location '/tmp/jira/dim'")
> spark.sqlContext.sql("msck repair table dim"){code}
>  
>   
>   
>  *Inner join example:*
> {code:java}
> select * from fact f
> join dim d
> on (f.id = d.id
> and f.part_col = d.part_col)
> where f.part_col = 100{code}
> Excerpt from Spark-SQL physical explain plan: 
> {code:java}
> HiveTableScan [id#411, part_col#412], CatalogRelation `default`.`fact`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#411], [part_col#412], 
> [isnotnull(part_col#412), (part_col#412 = 100)] 
> HiveTableScan [id#413, part_col#414], CatalogRelation `default`.`dim`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#413], [part_col#414], 
> [isnotnull(part_col#414), (part_col#414 = 100)]{code}
>  
>  *Outer join example:*
> {code:java}
> select * from fact f
> left join dim d
> on (f.id = d.id
> and f.part_col = d.part_col)
> where f.part_col = 100{code}
>  
>  Excerpt from Spark-SQL physical explain plan:
>   
> {code:java}
> HiveTableScan [id#426, part_col#427], CatalogRelation `default`.`fact`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#426], [part_col#427], 
> [isnotnull(part_col#427), (part_col#427 = 100)]   
> HiveTableScan [id#428, part_col#429], CatalogRelation `default`.`dim`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#428], [part_col#429] {code}
>  
>   
> As you can see the predicate is not pushed down to the HiveTableScan of the 
> dim table on the outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24847) ScalaReflection#schemaFor occasionally fails to detect schema for Seq of type alias

2018-07-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24847.
--
Resolution: Cannot Reproduce

> ScalaReflection#schemaFor occasionally fails to detect schema for Seq of type 
> alias
> ---
>
> Key: SPARK-24847
> URL: https://issues.apache.org/jira/browse/SPARK-24847
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ahmed Mahran
>Priority: Major
>
> org.apache.spark.sql.catalyst.ScalaReflection#schemaFor occasionally fails to 
> detect schema for Seq of type alias (and it occasionally succeeds).
>  
> {code:java}
> object Types {
>   type Alias1 = Long
>   type Alias2 = Int
>   type Alias3 = Int
> }
> case class B(b1: Alias1, b2: Seq[Alias2], b3: Option[Alias3])
> case class A(a1: B, a2: Int)
> {code}
>  
> {code}
> import sparkSession.implicits._
> val seq = Seq(
>   A(B(2L, Seq(3), Some(1)), 1),
>   A(B(3L, Seq(2), Some(2)), 2)
> )
> val ds = sparkSession.createDataset(seq)
> {code}
>  
> {code:java}
> java.lang.UnsupportedOperationException: Schema for type Seq[Types.Alias2] is 
> not supported at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:780)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:715)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:714)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:381)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:391)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:138)
>  at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72)
>  at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) at 
> org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248)
>  at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34)
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2018-07-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24853.
--
Resolution: Won't Fix

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2
>Reporter: nirav patel
>Priority: Major
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24884) Implement regexp_extract_all

2018-07-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24884:
-
Component/s: (was: Spark Core)
 SQL

> Implement regexp_extract_all
> 
>
> Key: SPARK-24884
> URL: https://issues.apache.org/jira/browse/SPARK-24884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nick Nicolini
>Priority: Major
>
> I've recently hit many cases of regexp parsing where we need to match on 
> something that is always arbitrary in length; for example, a text block that 
> looks something like:
> {code:java}
> AAA:WORDS|
> BBB:TEXT|
> MSG:ASDF|
> MSG:QWER|
> ...
> MSG:ZXCV|{code}
> Where I need to pull out all values between "MSG:" and "|", which can occur 
> in each instance between 1 and n times. I cannot reliably use the existing 
> {{regexp_extract}} method since the number of occurrences is always 
> arbitrary, and while I can write a UDF to handle this it'd be great if this 
> was supported natively in Spark.
> Perhaps we can implement something like {{regexp_extract_all}} as 
> [Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
> [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
>  have?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24885) Initialize random seeds for Rand and Randn expression during analysis

2018-07-22 Thread Liang-Chi Hsieh (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-24885.
-
Resolution: Won't Fix

> Initialize random seeds for Rand and Randn expression during analysis
> -
>
> Key: SPARK-24885
> URL: https://issues.apache.org/jira/browse/SPARK-24885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Random expressions such as Rand and Randn should have the same behavior as 
> Uuid that their random seeds should be initialized at analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24885) Initialize random seeds for Rand and Randn expression during analysis

2018-07-22 Thread Liang-Chi Hsieh (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-24885:

Description: 
Random expressions such as Rand and Randn should have the same behavior as Uuid 
that their random seeds should be initialized at analysis.



  was:
Random expressions such as Rand and Randn should have the same behavior as Uuid 
that 




> Initialize random seeds for Rand and Randn expression during analysis
> -
>
> Key: SPARK-24885
> URL: https://issues.apache.org/jira/browse/SPARK-24885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Random expressions such as Rand and Randn should have the same behavior as 
> Uuid that their random seeds should be initialized at analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24885) Initialize random seeds for Rand and Randn expression during analysis

2018-07-22 Thread Liang-Chi Hsieh (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-24885:

Summary: Initialize random seeds for Rand and Randn expression during 
analysis  (was: Rand and Randn expression should produce same result at 
DataFrame on retries)

> Initialize random seeds for Rand and Randn expression during analysis
> -
>
> Key: SPARK-24885
> URL: https://issues.apache.org/jira/browse/SPARK-24885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Random expressions such as Rand and Randn should have the same behavior as 
> Uuid that 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24885) Rand and Randn expression should produce same result at DataFrame on retries

2018-07-22 Thread Liang-Chi Hsieh (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-24885:

Description: 
Random expressions such as Rand and Randn should have the same behavior as Uuid 
that 



  was:
Random expressions such as Rand and Randn should have the same behavior as Uuid 
that produces the same results at the same DataFrame on re-treis.




> Rand and Randn expression should produce same result at DataFrame on retries
> 
>
> Key: SPARK-24885
> URL: https://issues.apache.org/jira/browse/SPARK-24885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Random expressions such as Rand and Randn should have the same behavior as 
> Uuid that 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24885) Rand and Randn expression should produce same result at DataFrame on retries

2018-07-22 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-24885:
---

 Summary: Rand and Randn expression should produce same result at 
DataFrame on retries
 Key: SPARK-24885
 URL: https://issues.apache.org/jira/browse/SPARK-24885
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


Random expressions such as Rand and Randn should have the same behavior as Uuid 
that produces the same results at the same DataFrame on re-treis.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22228) Add support for Array so from_json can parse

2018-07-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8.
--
Resolution: Duplicate

> Add support for Array so from_json can parse
> 
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: kant kodali
>Priority: Major
>
> {code:java}
> val inputDS = Seq("""["foo", "bar"]""").toDF
> {code}
> {code:java}
> inputDS.printSchema()
> root
>  |-- value: string (nullable = true)
> {code}
> Input Dataset inputDS
> {code:java}
> inputDS.show(false)
> value
> -
> ["foo", "bar"]
> {code}
> Expected output dataset outputDS
> {code:java}
> value
> ---
> "foo" |
> "bar" |
> {code}
> Tried explode function like below but it doesn't quite work
> {code:java}
> inputDS.select(explode(from_json(col("value"), ArrayType(StringType
> {code}
> and got the following error
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'jsontostructs(`value`)' due to data type mismatch: Input schema string must 
> be a struct or an array of structs
> {code}
> Also tried the following
> {code:java}
> inputDS.select(explode(col("value")))
> {code}
> And got the following error
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'explode(`value`)' due 
> to data type mismatch: input to function explode should be array or map type, 
> not StringType
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2018-07-22 Thread Nick Nicolini (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552155#comment-16552155
 ] 

Nick Nicolini commented on SPARK-16203:
---

Cool, added ticket here:https://issues.apache.org/jira/browse/SPARK-24884

I think the above is the same feature that [~mmoroz] was asking for, so IMO we 
close this ticket in favor of the newer one.

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24884) Implement regexp_extract_all

2018-07-22 Thread Nick Nicolini (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Nicolini updated SPARK-24884:
--
Description: 
I've recently hit many cases of regexp parsing where we need to match on 
something that is always arbitrary in length; for example, a text block that 
looks something like:
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the existing 
{{regexp_extract}} method since the number of occurrences is always arbitrary, 
and while I can write a UDF to handle this it'd be great if this was supported 
natively in Spark.

Perhaps we can implement something like {{regexp_extract_all}} as 
[Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?

 

  was:
I've recently hit many cases of regexp parsing where we need to match on 
something that is always arbitrary in length; for example, a text block that 
looks something like:
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the existing 
_regexp_extract_ method since the number of occurrences is always arbitrary, 
and while I can write a UDF to handle this it'd be great if this was supported 
natively in Spark.

Perhaps we can implement something like _regexp_extract_all_ as 
[Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?

 
{noformat}
*no* further _formatting_ is done here{noformat}


> Implement regexp_extract_all
> 
>
> Key: SPARK-24884
> URL: https://issues.apache.org/jira/browse/SPARK-24884
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Nick Nicolini
>Priority: Major
>
> I've recently hit many cases of regexp parsing where we need to match on 
> something that is always arbitrary in length; for example, a text block that 
> looks something like:
> {code:java}
> AAA:WORDS|
> BBB:TEXT|
> MSG:ASDF|
> MSG:QWER|
> ...
> MSG:ZXCV|{code}
> Where I need to pull out all values between "MSG:" and "|", which can occur 
> in each instance between 1 and n times. I cannot reliably use the existing 
> {{regexp_extract}} method since the number of occurrences is always 
> arbitrary, and while I can write a UDF to handle this it'd be great if this 
> was supported natively in Spark.
> Perhaps we can implement something like {{regexp_extract_all}} as 
> [Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
> [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
>  have?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24884) Implement regexp_extract_all

2018-07-22 Thread Nick Nicolini (JIRA)

Nick Nicolini created SPARK-24884:
-

 Summary: Implement regexp_extract_all
 Key: SPARK-24884
 URL: https://issues.apache.org/jira/browse/SPARK-24884
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Nick Nicolini


I've recently hit many cases of regexp parsing where we need to match on 
something that is always arbitrary in length; for example, a text block that 
looks something like:
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the existing 
_regexp_extract_ method since the number of occurrences is always arbitrary, 
and while I can write a UDF to handle this it'd be great if this was supported 
natively in Spark.

Perhaps we can implement something like _regexp_extract_all_ as 
[Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24884) Implement regexp_extract_all

2018-07-22 Thread Nick Nicolini (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Nicolini updated SPARK-24884:
--
Description: 
I've recently hit many cases of regexp parsing where we need to match on 
something that is always arbitrary in length; for example, a text block that 
looks something like:
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the existing 
_regexp_extract_ method since the number of occurrences is always arbitrary, 
and while I can write a UDF to handle this it'd be great if this was supported 
natively in Spark.

Perhaps we can implement something like _regexp_extract_all_ as 
[Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?

 
{noformat}
*no* further _formatting_ is done here{noformat}

  was:
I've recently hit many cases of regexp parsing where we need to match on 
something that is always arbitrary in length; for example, a text block that 
looks something like:
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the existing 
_regexp_extract_ method since the number of occurrences is always arbitrary, 
and while I can write a UDF to handle this it'd be great if this was supported 
natively in Spark.

Perhaps we can implement something like _regexp_extract_all_ as 
[Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?


> Implement regexp_extract_all
> 
>
> Key: SPARK-24884
> URL: https://issues.apache.org/jira/browse/SPARK-24884
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Nick Nicolini
>Priority: Major
>
> I've recently hit many cases of regexp parsing where we need to match on 
> something that is always arbitrary in length; for example, a text block that 
> looks something like:
> {code:java}
> AAA:WORDS|
> BBB:TEXT|
> MSG:ASDF|
> MSG:QWER|
> ...
> MSG:ZXCV|{code}
> Where I need to pull out all values between "MSG:" and "|", which can occur 
> in each instance between 1 and n times. I cannot reliably use the existing 
> _regexp_extract_ method since the number of occurrences is always arbitrary, 
> and while I can write a UDF to handle this it'd be great if this was 
> supported natively in Spark.
> Perhaps we can implement something like _regexp_extract_all_ as 
> [Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
> [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
>  have?
>  
> {noformat}
> *no* further _formatting_ is done here{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16483) Unifying struct fields and columns

2018-07-22 Thread Simeon Simeonov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-16483:

Description: 
This issue comes as a result of an exchange with Michael Armbrust outside of 
the usual JIRA/dev list channels.

DataFrame provides a full set of manipulation operations for top-level columns. 
They have be added, removed, modified and renamed. The same is not true about 
fields inside structs yet, from a logical standpoint, Spark users may very well 
want to perform the same operations on struct fields, especially since 
automatic schema discovery from JSON input tends to create deeply nested 
structs.

Common use-cases include:
 - Remove and/or rename struct field(s) to adjust the schema
 - Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling 
{{named_struct}} and listing all fields, including ones we don't want to 
manipulate. This leads to complex, fragile code that cannot survive schema 
evolution.

It would be far better if the various APIs that can now manipulate top-level 
columns were extended to handle struct fields at arbitrary locations or, 
alternatively, if we introduced new APIs for modifying any field in a 
dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes (overloaded methods are not shown):
{code:java}
class Column(val expr: Expression) extends Logging {

  // ...

  // matches Dataset.schema semantics
  def schema: StructType

  // matches Dataset.select() semantics
  // '* support allows multiple new fields to be added easily, saving 
cumbersome repeated withColumn() calls
  def select(cols: Column*): Column

  // matches Dataset.withColumn() semantics of add or replace
  def withColumn(colName: String, col: Column): Column

  // matches Dataset.drop() semantics
  def drop(colName: String): Column

}

class Dataset[T] ... {

  // ...

  // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
  def cast(newShema: StructType): DataFrame

}
{code}
The benefit of the above API is that it unifies manipulating top-level & nested 
columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for 
nested field reordering, casting, etc., which is important in data exchange 
scenarios where field position matters. That's also the reason to add {{cast}} 
to {{Dataset}}: it improves consistency and readability (with method chaining). 
Another way to think of {{Dataset.cast}} is as the Spark schema equivalent of 
{{Dataset.as}}. {{as}} is to {{cast}} as a Scala encodable type is to a 
{{StructType}} instance.

  was:
This issue comes as a result of an exchange with Michael Armbrust outside of 
the usual JIRA/dev list channels.

DataFrame provides a full set of manipulation operations for top-level columns. 
They have be added, removed, modified and renamed. The same is not true about 
fields inside structs yet, from a logical standpoint, Spark users may very well 
want to perform the same operations on struct fields, especially since 
automatic schema discovery from JSON input tends to create deeply nested 
structs.

Common use-cases include:
 - Remove and/or rename struct field(s) to adjust the schema
 - Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling 
{{named_struct}} and listing all fields, including ones we don't want to 
manipulate. This leads to complex, fragile code that cannot survive schema 
evolution.

It would be far better if the various APIs that can now manipulate top-level 
columns were extended to handle struct fields at arbitrary locations or, 
alternatively, if we introduced new APIs for modifying any field in a 
dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes (overloaded methods are not shown):
{code:java}
class Column(val expr: Expression) extends Logging {

  // ...

  // matches Dataset.schema semantics
  def schema: StructType

  // matches Dataset.select() semantics
  // '* support allows multiple new fields to be added easily, saving 
cumbersome repeated withColumn() calls
  def select(cols: Column*): Column

  // matches Dataset.withColumn() semantics of add or replace
  def withColumn(colName: String, col: Column): Column

  // matches Dataset.drop() semantics
  def drop(colName: String): Column

}

class Dataset[T] ... {

  // ...

  // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
  def cast(newShema: StructType): DataFrame

}
{code}
The benefit of the above API is that it unifies manipulating top-level & nested 
columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for 
nested field reordering, casting, etc., which is important in data exchange 
scenarios where field position matters. That's also the reason to add {{cast}} 
to {{Data

[jira] [Commented] (SPARK-24768) Have a built-in AVRO data source implementation

2018-07-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552125#comment-16552125
 ] 

Apache Spark commented on SPARK-24768:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21841

> Have a built-in AVRO data source implementation
> ---
>
> Key: SPARK-24768
> URL: https://issues.apache.org/jira/browse/SPARK-24768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: Built-in AVRO Data Source In Spark 2.4.pdf
>
>
> Apache Avro (https://avro.apache.org) is a popular data serialization format. 
> It is widely used in the Spark and Hadoop ecosystem, especially for 
> Kafka-based data pipelines.  Using the external package 
> [https://github.com/databricks/spark-avro], Spark SQL can read and write the 
> avro data. Making spark-Avro built-in can provide a better experience for 
> first-time users of Spark SQL and structured streaming. We expect the 
> built-in Avro data source can further improve the adoption of structured 
> streaming. The proposal is to inline code from spark-avro package 
> ([https://github.com/databricks/spark-avro]). The target release is Spark 
> 2.4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24883) Remove implicit class AvroDataFrameWriter/AvroDataFrameReader

2018-07-22 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-24883:
--

 Summary: Remove implicit class 
AvroDataFrameWriter/AvroDataFrameReader
 Key: SPARK-24883
 URL: https://issues.apache.org/jira/browse/SPARK-24883
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


As per Reynold's comment: 
[https://github.com/apache/spark/pull/21742#discussion_r203496489]

It makes sense to remove the implicit class 
AvroDataFrameWriter/AvroDataFrameReader, since the Avro package is external 
module. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data

2018-07-22 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552103#comment-16552103
 ] 

Xiao Li commented on SPARK-24869:
-

[~maropu] Just updated the test case in the JIRA. Forgot to cache it when I 
copy and paste the example. 

I think the fix in your branch is not in a right direction. Do you know why the 
cached data is not used? 

> SaveIntoDataSourceCommand's input Dataset does not use Cached Data
> --
>
> Key: SPARK-24869
> URL: https://issues.apache.org/jira/browse/SPARK-24869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> withTable("t") {
>   withTempPath { path =>
> var numTotalCachedHit = 0
> val listener = new QueryExecutionListener {
>   override def onFailure(f: String, qe: QueryExecution, e: 
> Exception):Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, 
> duration: Long): Unit = {
> qe.withCachedData match {
>   case c: SaveIntoDataSourceCommand
>   if c.query.isInstanceOf[InMemoryRelation] =>
> numTotalCachedHit += 1
>   case _ =>
> println(qe.withCachedData)
> }
>   }
> }
> spark.listenerManager.register(listener)
> val udf1 = udf({ (x: Int, y: Int) => x + y })
> val df = spark.range(0, 3).toDF("a")
>   .withColumn("b", udf1(col("a"), lit(10)))
> df.cache()
> df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
> properties)
> assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data

2018-07-22 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24869:

Description: 
{code}
withTable("t") {
  withTempPath { path =>

var numTotalCachedHit = 0
val listener = new QueryExecutionListener {
  override def onFailure(f: String, qe: QueryExecution, e: 
Exception):Unit = {}

  override def onSuccess(funcName: String, qe: QueryExecution, 
duration: Long): Unit = {
qe.withCachedData match {
  case c: SaveIntoDataSourceCommand
  if c.query.isInstanceOf[InMemoryRelation] =>
numTotalCachedHit += 1
  case _ =>
println(qe.withCachedData)
}
  }
}
spark.listenerManager.register(listener)

val udf1 = udf({ (x: Int, y: Int) => x + y })
val df = spark.range(0, 3).toDF("a")
  .withColumn("b", udf1(col("a"), lit(10)))
df.cache()
df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
properties)
assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
  }
}
{code}

  was:
{code}
withTable("t") {
  withTempPath { path =>

var numTotalCachedHit = 0
val listener = new QueryExecutionListener {
  override def onFailure(f: String, qe: QueryExecution, e: 
Exception):Unit = {}

  override def onSuccess(funcName: String, qe: QueryExecution, 
duration: Long): Unit = {
qe.withCachedData match {
  case c: SaveIntoDataSourceCommand
  if c.query.isInstanceOf[InMemoryRelation] =>
numTotalCachedHit += 1
  case _ =>
println(qe.withCachedData)
}
  }
}
spark.listenerManager.register(listener)

val udf1 = udf({ (x: Int, y: Int) => x + y })
val df = spark.range(0, 3).toDF("a")
  .withColumn("b", udf1(col("a"), lit(10)))
df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
properties)
assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
  }
}
{code}


> SaveIntoDataSourceCommand's input Dataset does not use Cached Data
> --
>
> Key: SPARK-24869
> URL: https://issues.apache.org/jira/browse/SPARK-24869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> withTable("t") {
>   withTempPath { path =>
> var numTotalCachedHit = 0
> val listener = new QueryExecutionListener {
>   override def onFailure(f: String, qe: QueryExecution, e: 
> Exception):Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, 
> duration: Long): Unit = {
> qe.withCachedData match {
>   case c: SaveIntoDataSourceCommand
>   if c.query.isInstanceOf[InMemoryRelation] =>
> numTotalCachedHit += 1
>   case _ =>
> println(qe.withCachedData)
> }
>   }
> }
> spark.listenerManager.register(listener)
> val udf1 = udf({ (x: Int, y: Int) => x + y })
> val df = spark.range(0, 3).toDF("a")
>   .withColumn("b", udf1(col("a"), lit(10)))
> df.cache()
> df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
> properties)
> assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-07-22 Thread James (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552087#comment-16552087
 ] 

James commented on SPARK-23206:
---

Hi  [~elu]

If I want to know the CPU metrics of executor level, what kind of API could I 
use? Recently I am doing a project which needs the CPU metrics.

 

Thx

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16483) Unifying struct fields and columns

2018-07-22 Thread Simeon Simeonov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-16483:

Affects Version/s: 2.3.1
  Description: 
This issue comes as a result of an exchange with Michael Armbrust outside of 
the usual JIRA/dev list channels.

DataFrame provides a full set of manipulation operations for top-level columns. 
They have be added, removed, modified and renamed. The same is not true about 
fields inside structs yet, from a logical standpoint, Spark users may very well 
want to perform the same operations on struct fields, especially since 
automatic schema discovery from JSON input tends to create deeply nested 
structs.

Common use-cases include:
 - Remove and/or rename struct field(s) to adjust the schema
 - Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling 
{{named_struct}} and listing all fields, including ones we don't want to 
manipulate. This leads to complex, fragile code that cannot survive schema 
evolution.

It would be far better if the various APIs that can now manipulate top-level 
columns were extended to handle struct fields at arbitrary locations or, 
alternatively, if we introduced new APIs for modifying any field in a 
dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes (overloaded methods are not shown):
{code:java}
class Column(val expr: Expression) extends Logging {

  // ...

  // matches Dataset.schema semantics
  def schema: StructType

  // matches Dataset.select() semantics
  // '* support allows multiple new fields to be added easily, saving 
cumbersome repeated withColumn() calls
  def select(cols: Column*): Column

  // matches Dataset.withColumn() semantics of add or replace
  def withColumn(colName: String, col: Column): Column

  // matches Dataset.drop() semantics
  def drop(colName: String): Column

}

class Dataset[T] ... {

  // ...

  // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
  def cast(newShema: StructType): DataFrame

}
{code}
The benefit of the above API is that it unifies manipulating top-level & nested 
columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for 
nested field reordering, casting, etc., which is important in data exchange 
scenarios where field position matters. That's also the reason to add {{cast}} 
to {{Dataset}}: it improves consistency and readability (with method chaining).

  was:
This issue comes as a result of an exchange with Michael Armbrust outside of 
the usual JIRA/dev list channels. 

DataFrame provides a full set of manipulation operations for top-level columns. 
They have be added, removed, modified and renamed. The same is not true about 
fields inside structs yet, from a logical standpoint, Spark users may very well 
want to perform the same operations on struct fields, especially since 
automatic schema discovery from JSON input tends to create deeply nested 
structs.

Common use-cases include:

- Remove and/or rename struct field(s) to adjust the schema
- Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling 
{{named_struct}} and listing all fields, including ones we don't want to 
manipulate. This leads to complex, fragile code that cannot survive schema 
evolution.

It would be far better if the various APIs that can now manipulate top-level 
columns were extended to handle struct fields at arbitrary locations or, 
alternatively, if we introduced new APIs for modifying any field in a 
dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes, here is the skeleton implementation of an 
update() implicit that we've use to modify any existing field in a dataframe. 
(Note that it depends on various other utilities and implicits that are not 
included). https://gist.github.com/ssimeonov/f98dcfa03cd067157fa08aaa688b0f66


> Unifying struct fields and columns
> --
>
> Key: SPARK-16483
> URL: https://issues.apache.org/jira/browse/SPARK-16483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: sql
>
> This issue comes as a result of an exchange with Michael Armbrust outside of 
> the usual JIRA/dev list channels.
> DataFrame provides a full set of manipulation operations for top-level 
> columns. They have be added, removed, modified and renamed. The same is not 
> true about fields inside structs yet, from a logical standpoint, Spark users 
> may very well want to perform the same operations on struct fields, 
> especially since automatic schema discovery from JSON input tends to create 
> d

[jira] [Resolved] (SPARK-22562) CachedKafkaConsumer unsafe eviction from cache

2018-07-22 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22562.
---
Resolution: Not A Problem

> CachedKafkaConsumer unsafe eviction from cache
> --
>
> Key: SPARK-22562
> URL: https://issues.apache.org/jira/browse/SPARK-22562
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Dariusz Szablinski
>Priority: Major
>
> From time to time a job fails because of 
> "java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access" (full stacktrace below). I think it happens when one 
> thread wants to add a new consumer into fully packed cache and another one 
> still uses an instance of cached consumer which is marked for eviction.
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1361)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer$$anon$1.removeEldestEntry(CachedKafkaConsumer.scala:128)
> at java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:299)
> at java.util.HashMap.putVal(HashMap.java:663)
> at java.util.HashMap.put(HashMap.java:611)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.get(CachedKafkaConsumer.scala:158)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:206)
> at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:181)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-22 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552041#comment-16552041
 ] 

Wenchen Fan commented on SPARK-24882:
-

cc people we may be interested: [~rxin] [~rdblue] [~marmbrus] [~LI,Xiao] 
[~Gengliang.Wang]

> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-22 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552041#comment-16552041
 ] 

Wenchen Fan edited comment on SPARK-24882 at 7/22/18 2:49 PM:
--

cc people who may be interested: [~rxin] [~rdblue] [~marmbrus] [~LI,Xiao] 
[~Gengliang.Wang]


was (Author: cloud_fan):
cc people we may be interested: [~rxin] [~rdblue] [~marmbrus] [~LI,Xiao] 
[~Gengliang.Wang]

> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24339) spark sql can not prune column in transform/map/reduce query

2018-07-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552040#comment-16552040
 ] 

Apache Spark commented on SPARK-24339:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/21839

> spark sql can not prune column in transform/map/reduce query
> 
>
> Key: SPARK-24339
> URL: https://issues.apache.org/jira/browse/SPARK-24339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: xdcjie
>Priority: Minor
>  Labels: map, reduce, sql, transform
>
> I was using {{TRANSFORM USING}} with branch-2.1/2.2, and noticed that it will 
> scan all column of data, query like:
> {code:java}
> SELECT TRANSFORM(usid, cch) USING 'python test.py' AS (u1, c1, u2, c2) FROM 
> test_table;{code}
> it's physical plan like:
> {code:java}
> == Physical Plan ==
> ScriptTransformation [usid#17, cch#9], python test.py, [u1#784, c1#785, 
> u2#786, c2#787], 
> HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,
> )),List((field.delim,   
> )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
> +- FileScan parquet 
> [sh#0L,clk#1L,chg#2L,qey#3,ship#4,chgname#5,sid#6,bid#7,dis#8L,cch#9,wch#10,wid#11L,arank#12L,rtag#13,iid#14,uid#15,pid#16,usid#17,wdid#18,bid#19,oqleft#20,oqright#21,poqvalue#22,tm#23,...
>  367 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/Downloads/part-r-00093-0ef5d59f-2e08-4085-9b46-458a1652932a.g...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}
> In our scenario, parquet has 400 column, this query will take more time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-22 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24882:

External issue URL:   (was: 
https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing)

> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-22 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24882:

Description: 
Data source V2 is out for a while, see the SPIP 
[here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
 We have already migrated most of the built-in streaming data sources to the V2 
API, and the file source migration is in progress. During the migration, we 
found several problems and want to address them before we stabilize the V2 API.

To solve these problems, we need to separate responsibilities in the data 
source v2 read API. Details please see the attached google doc: 
https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing

  was:
Data source V2 is out for a while, see the SPIP 
[here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
 We have already migrated most of the built-in streaming data sources to the V2 
API, and the file source migration is in progress. During the migration, we 
found several problems and want to address them before we stabilize the V2 API.

To solve these problems, we need to separate responsibilities in the data 
source v2 read API. Details please see the attached google doc.


> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-22 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24882:

External issue URL: 
https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing

> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-22 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-24882:
---

 Summary: separate responsibilities of the data source v2 read API
 Key: SPARK-24882
 URL: https://issues.apache.org/jira/browse/SPARK-24882
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan


Data source V2 is out for a while, see the SPIP 
[here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
 We have already migrated most of the built-in streaming data sources to the V2 
API, and the file source migration is in progress. During the migration, we 
found several problems and want to address them before we stabilize the V2 API.

To solve these problems, we need to separate responsibilities in the data 
source v2 read API. Details please see the attached google doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24811) Add function `from_avro` and `to_avro`

2018-07-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551997#comment-16551997
 ] 

Apache Spark commented on SPARK-24811:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21838

> Add function `from_avro` and `to_avro`
> --
>
> Key: SPARK-24811
> URL: https://issues.apache.org/jira/browse/SPARK-24811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Add a new function from_avro for parsing a binary column of avro format and 
> converting it into its corresponding catalyst value.
> Add a new function to_avro for converting a column into binary of avro format 
> with the specified schema.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-07-22 Thread Dilip Biswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551968#comment-16551968
 ] 

Dilip Biswal commented on SPARK-21274:
--

[~maropu] Thanks a lot for the info.

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2018-07-22 Thread Herman van Hovell (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551966#comment-16551966
 ] 

Herman van Hovell commented on SPARK-16203:
---

[~nnicolini] adding {{regexp_extract_all}} makes sense. Can you file a new 
ticket for this? BTW there might already one.

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-07-22 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551965#comment-16551965
 ] 

Takeshi Yamamuro edited comment on SPARK-21274 at 7/22/18 9:35 AM:
---

ok, thanks! Since the feature freeze will come soon, I think this feature will 
appear in v2.5 or v3.0. So, we don't need to be in a hurry for this.


was (Author: maropu):
ok, thanks!

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-07-22 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551965#comment-16551965
 ] 

Takeshi Yamamuro commented on SPARK-21274:
--

ok, thanks!

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-07-22 Thread Dilip Biswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551954#comment-16551954
 ] 

Dilip Biswal commented on SPARK-21274:
--

[~maropu] Hi Takeshi, yeah.. So the code that does the rewrite is already in 
place. I am looking into a alternate way to ReplicateRows and will update in a 
few days.

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24881) New options - compression and compressionLevel

2018-07-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24881:


Assignee: Apache Spark

> New options - compression and compressionLevel
> --
>
> Key: SPARK-24881
> URL: https://issues.apache.org/jira/browse/SPARK-24881
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently Avro datasource takes the compression codec name from SQL config 
> (config key is hard coded in AvroFileFormat): 
> https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125
>  . The obvious cons of it is modification of the global config can impact of 
> multiple writes.
> A purpose of the ticket is to add new Avro option - "compression" the same as 
> we already have for other datasource like JSON, CSV and etc. If new option is 
> not set by an user, we take settings from SQL config 
> spark.sql.avro.compression.codec. If the former one is not set too, default 
> compression codec will be snappy (this is current behavior in the master).
> Besides of the compression option, need to add another option - 
> compressionLevel which should reflect another SQL config in Avro: 
> https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24881) New options - compression and compressionLevel

2018-07-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551952#comment-16551952
 ] 

Apache Spark commented on SPARK-24881:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21837

> New options - compression and compressionLevel
> --
>
> Key: SPARK-24881
> URL: https://issues.apache.org/jira/browse/SPARK-24881
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently Avro datasource takes the compression codec name from SQL config 
> (config key is hard coded in AvroFileFormat): 
> https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125
>  . The obvious cons of it is modification of the global config can impact of 
> multiple writes.
> A purpose of the ticket is to add new Avro option - "compression" the same as 
> we already have for other datasource like JSON, CSV and etc. If new option is 
> not set by an user, we take settings from SQL config 
> spark.sql.avro.compression.codec. If the former one is not set too, default 
> compression codec will be snappy (this is current behavior in the master).
> Besides of the compression option, need to add another option - 
> compressionLevel which should reflect another SQL config in Avro: 
> https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24881) New options - compression and compressionLevel

2018-07-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24881:


Assignee: (was: Apache Spark)

> New options - compression and compressionLevel
> --
>
> Key: SPARK-24881
> URL: https://issues.apache.org/jira/browse/SPARK-24881
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently Avro datasource takes the compression codec name from SQL config 
> (config key is hard coded in AvroFileFormat): 
> https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125
>  . The obvious cons of it is modification of the global config can impact of 
> multiple writes.
> A purpose of the ticket is to add new Avro option - "compression" the same as 
> we already have for other datasource like JSON, CSV and etc. If new option is 
> not set by an user, we take settings from SQL config 
> spark.sql.avro.compression.codec. If the former one is not set too, default 
> compression codec will be snappy (this is current behavior in the master).
> Besides of the compression option, need to add another option - 
> compressionLevel which should reflect another SQL config in Avro: 
> https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24881) New options - compression and compressionLevel

2018-07-22 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-24881:
--

 Summary: New options - compression and compressionLevel
 Key: SPARK-24881
 URL: https://issues.apache.org/jira/browse/SPARK-24881
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Currently Avro datasource takes the compression codec name from SQL config 
(config key is hard coded in AvroFileFormat): 
https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125
 . The obvious cons of it is modification of the global config can impact of 
multiple writes.

A purpose of the ticket is to add new Avro option - "compression" the same as 
we already have for other datasource like JSON, CSV and etc. If new option is 
not set by an user, we take settings from SQL config 
spark.sql.avro.compression.codec. If the former one is not set too, default 
compression codec will be snappy (this is current behavior in the master).

Besides of the compression option, need to add another option - 
compressionLevel which should reflect another SQL config in Avro: 
https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-07-22 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551921#comment-16551921
 ] 

Takeshi Yamamuro commented on SPARK-21274:
--

Anybody still working on this?

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

48 matches

Mail list logo