[jira] [Commented] (SPARK-21136) Misleading error message for typo in SQL

2017-12-27 Thread Denys Zadorozhnyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304378#comment-16304378
 ] 

Denys Zadorozhnyi commented on SPARK-21136:
---

I dug up into this issue in Spark 2.2.1 (ANTLR 4.7) and here are my findings:
# 1 The offending token is gathered in {{InputMismatchException}} from 
{{recognizer.getCurrentToken()}} (which is "from" in the examples above).
# 2 In ANTLR 4.7.1 in the these cases {{InputMismatchException.offendingState}} 
and {{InputMismatchException .ctx}} are additionally set that should give some 
clues (see [https://github.com/antlr/antlr4/pull/1969] and the issue 
[https://github.com/antlr/antlr4/issues/1922] for details). However the error 
message is generated in the ANTLR's {{DefaultErrorStrategy.reportErrror()}}. 
I've considered the idea to pass an error handler ({{DefaultErrorStrategy}} 
subclass) to the parser and override {{reportError()}}, make a new 
{{InputMismatchException}} with "correct" {{offendingToken}} and pass it up the 
chain but it did not feel right. 

> Misleading error message for typo in SQL
> 
>
> Key: SPARK-21136
> URL: https://issues.apache.org/jira/browse/SPARK-21136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> {code}
> scala> spark.sql("select * from a left joinn b on a.id = b.id").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
> == SQL ==
> select * from a left joinn b on a.id = b.id
> -^^^
> {code}
> The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of 
> the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
> themselves, a misleading error like this can hinder debugging substantially.
> I tried to see if maybe I could fix this. Am I correct to deduce that the 
> error message originates in ANTLR4, which parses the query based on the 
> syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out 
> how that syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22906) External shuffle IP different from Host ip

2017-12-27 Thread Unai Sarasola (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Unai Sarasola updated SPARK-22906:
--
Priority: Minor  (was: Major)

> External shuffle IP different from Host ip
> --
>
> Key: SPARK-22906
> URL: https://issues.apache.org/jira/browse/SPARK-22906
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.2.1
>Reporter: Unai Sarasola
>Priority: Minor
>
> Now is possible to configure the spark.shuffle.service.port, but there aren't 
> an equivalent for host. Imagine that you are using an external shuffle 
> deployed in Docker. 
> If you aren't using Host mode for the docker, you may want to use the 
> internal ip address of the docker to connect to the external shuffle service.
> Also you could use Calico, or just being attached the spark shuffle to a 
> different IP that is used in the Spark executor (example hosts with multiple 
> network interfaces).
> So this is why I implemented the spark.shuffle.service.host  configuration



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22906) External shuffle IP different from Host ip

2017-12-27 Thread Unai Sarasola (JIRA)
Unai Sarasola created SPARK-22906:
-

 Summary: External shuffle IP different from Host ip
 Key: SPARK-22906
 URL: https://issues.apache.org/jira/browse/SPARK-22906
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 2.2.1
Reporter: Unai Sarasola


Now is possible to configure the spark.shuffle.service.port, but there aren't 
an equivalent for host. Imagine that you are using an external shuffle deployed 
in Docker. 

If you aren't using Host mode for the docker, you may want to use the internal 
ip address of the docker to connect to the external shuffle service.

Also you could use Calico, or just being attached the spark shuffle to a 
different IP that is used in the Spark executor (example hosts with multiple 
network interfaces).

So this is why I implemented the spark.shuffle.service.host  configuration



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22906) External shuffle IP different from Host ip

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304382#comment-16304382
 ] 

Apache Spark commented on SPARK-22906:
--

User 'pianista215' has created a pull request for this issue:
https://github.com/apache/spark/pull/20083

> External shuffle IP different from Host ip
> --
>
> Key: SPARK-22906
> URL: https://issues.apache.org/jira/browse/SPARK-22906
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.2.1
>Reporter: Unai Sarasola
>Priority: Minor
>
> Now is possible to configure the spark.shuffle.service.port, but there aren't 
> an equivalent for host. Imagine that you are using an external shuffle 
> deployed in Docker. 
> If you aren't using Host mode for the docker, you may want to use the 
> internal ip address of the docker to connect to the external shuffle service.
> Also you could use Calico, or just being attached the spark shuffle to a 
> different IP that is used in the Spark executor (example hosts with multiple 
> network interfaces).
> So this is why I implemented the spark.shuffle.service.host  configuration



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22906) External shuffle IP different from Host ip

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22906:


Assignee: Apache Spark

> External shuffle IP different from Host ip
> --
>
> Key: SPARK-22906
> URL: https://issues.apache.org/jira/browse/SPARK-22906
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.2.1
>Reporter: Unai Sarasola
>Assignee: Apache Spark
>Priority: Minor
>
> Now is possible to configure the spark.shuffle.service.port, but there aren't 
> an equivalent for host. Imagine that you are using an external shuffle 
> deployed in Docker. 
> If you aren't using Host mode for the docker, you may want to use the 
> internal ip address of the docker to connect to the external shuffle service.
> Also you could use Calico, or just being attached the spark shuffle to a 
> different IP that is used in the Spark executor (example hosts with multiple 
> network interfaces).
> So this is why I implemented the spark.shuffle.service.host  configuration



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22906) External shuffle IP different from Host ip

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22906:


Assignee: (was: Apache Spark)

> External shuffle IP different from Host ip
> --
>
> Key: SPARK-22906
> URL: https://issues.apache.org/jira/browse/SPARK-22906
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.2.1
>Reporter: Unai Sarasola
>Priority: Minor
>
> Now is possible to configure the spark.shuffle.service.port, but there aren't 
> an equivalent for host. Imagine that you are using an external shuffle 
> deployed in Docker. 
> If you aren't using Host mode for the docker, you may want to use the 
> internal ip address of the docker to connect to the external shuffle service.
> Also you could use Calico, or just being attached the spark shuffle to a 
> different IP that is used in the Spark executor (example hosts with multiple 
> network interfaces).
> So this is why I implemented the spark.shuffle.service.host  configuration



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22907) MetadataFetchFailedException broadcast is already present

2017-12-27 Thread liupengcheng (JIRA)
liupengcheng created SPARK-22907:


 Summary: MetadataFetchFailedException broadcast is already present
 Key: SPARK-22907
 URL: https://issues.apache.org/jira/browse/SPARK-22907
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.3.0
 Environment: Spark2.1.0 + yarn
Reporter: liupengcheng


Currently, some IOException(may caused by physical environment) may cause 
MetadataFetchFailedException,  and the stage will be retried. however, when in 
the retrying of the stage, if the task is scheduled at the same executors where 
the first MetadataFetchFailedException happens, a 
'MetadataFetchFailedException: broadcast is already present' will be thrown.


{noformat}
org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: 
java.lang.IllegalArgumentException: requirement failed: Block broadcast_22 is 
already present in the MemoryStore
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1224)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:167)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675)
at 
org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675)
at org.apache.spark.Logging$class.logInfo(Logging.scala:58)
at 
org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:612)
at 
org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:674)
at 
org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:202)
at 
org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:141)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:55)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
at 
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1112)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1112)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1252)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1120)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)

[jira] [Commented] (SPARK-22907) MetadataFetchFailedException broadcast is already present

2017-12-27 Thread liupengcheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304389#comment-16304389
 ] 

liupengcheng commented on SPARK-22907:
--

I think we should cleanup the garbage broadcast when IOException occurred at 
the reading of the mapstatuses broadcast. thus, we can avoid task of retry 
stage throw the 'MetadataFetchFaieldException; broadcast is already present'  
exception.

> MetadataFetchFailedException broadcast is already present
> -
>
> Key: SPARK-22907
> URL: https://issues.apache.org/jira/browse/SPARK-22907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0
> Environment: Spark2.1.0 + yarn
>Reporter: liupengcheng
>
> Currently, some IOException(may caused by physical environment) may cause 
> MetadataFetchFailedException,  and the stage will be retried. however, when 
> in the retrying of the stage, if the task is scheduled at the same executors 
> where the first MetadataFetchFailedException happens, a 
> 'MetadataFetchFailedException: broadcast is already present' will be thrown.
> {noformat}
> org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: 
> java.lang.IllegalArgumentException: requirement failed: Block broadcast_22 is 
> already present in the MemoryStore
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1224)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:167)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675)
> at org.apache.spark.Logging$class.logInfo(Logging.scala:58)
> at 
> org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:612)
> at 
> org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:674)
> at 
> org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:202)
> at 
> org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:141)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:55)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140)
> at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
> at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112)
> at 
> org.apache.spark.

[jira] [Updated] (SPARK-22903) AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber

2017-12-27 Thread liupengcheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-22903:
-
Flags: Important

> AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber
> -
>
> Key: SPARK-22903
> URL: https://issues.apache.org/jira/browse/SPARK-22903
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0
> Environment: Spark2.1.0 + yarn
>Reporter: liupengcheng
>  Labels: core
>
> We  submit a Spark2.1.0 spark job, however, when MetadataFetchFailed 
> exception ocurred, stage is being retried, but a AlreadyBeingCreatedException 
> is thrown and finally caused job failure.
> {noformat}
> 2017-12-21,21:30:58,406 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 13.0 in stage 7.1 (TID 18990, , executor 326): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
>  Failed to create file 
> [//_temporary/0/_temporary/attempt_201712211720_0026_r_14_0/part-r-00014.snappy.parquet]
>  for [DFSClient_NONMAPREDUCE_-1477691024_103] for client [10.136.42.10], 
> because this file is already being created by 
> [DFSClient_NONMAPREDUCE_940892524_103] on [10.118.21.26]
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2672)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2388)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2317)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2270)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:604)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:374)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1806)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> at org.apache.hadoop.ipc.Client.call(Client.java:1477)
> at org.apache.hadoop.ipc.Client.call(Client.java:1408)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy21.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:301)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy22.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1779)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1773)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1698)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:433)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:429)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:444)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:373)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:928)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:806)
> at 
> org.apache.parque

[jira] [Assigned] (SPARK-22907) MetadataFetchFailedException broadcast is already present

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22907:


Assignee: Apache Spark

> MetadataFetchFailedException broadcast is already present
> -
>
> Key: SPARK-22907
> URL: https://issues.apache.org/jira/browse/SPARK-22907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0
> Environment: Spark2.1.0 + yarn
>Reporter: liupengcheng
>Assignee: Apache Spark
>
> Currently, some IOException(may caused by physical environment) may cause 
> MetadataFetchFailedException,  and the stage will be retried. however, when 
> in the retrying of the stage, if the task is scheduled at the same executors 
> where the first MetadataFetchFailedException happens, a 
> 'MetadataFetchFailedException: broadcast is already present' will be thrown.
> {noformat}
> org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: 
> java.lang.IllegalArgumentException: requirement failed: Block broadcast_22 is 
> already present in the MemoryStore
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1224)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:167)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675)
> at org.apache.spark.Logging$class.logInfo(Logging.scala:58)
> at 
> org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:612)
> at 
> org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:674)
> at 
> org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:202)
> at 
> org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:141)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:55)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140)
> at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
> at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1112)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfu

[jira] [Commented] (SPARK-22907) MetadataFetchFailedException broadcast is already present

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304396#comment-16304396
 ] 

Apache Spark commented on SPARK-22907:
--

User 'liupc' has created a pull request for this issue:
https://github.com/apache/spark/pull/20090

> MetadataFetchFailedException broadcast is already present
> -
>
> Key: SPARK-22907
> URL: https://issues.apache.org/jira/browse/SPARK-22907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0
> Environment: Spark2.1.0 + yarn
>Reporter: liupengcheng
>
> Currently, some IOException(may caused by physical environment) may cause 
> MetadataFetchFailedException,  and the stage will be retried. however, when 
> in the retrying of the stage, if the task is scheduled at the same executors 
> where the first MetadataFetchFailedException happens, a 
> 'MetadataFetchFailedException: broadcast is already present' will be thrown.
> {noformat}
> org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: 
> java.lang.IllegalArgumentException: requirement failed: Block broadcast_22 is 
> already present in the MemoryStore
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1224)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:167)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675)
> at org.apache.spark.Logging$class.logInfo(Logging.scala:58)
> at 
> org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:612)
> at 
> org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:674)
> at 
> org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:202)
> at 
> org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:141)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:55)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140)
> at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
> at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1112)
> at 
> org.apa

[jira] [Assigned] (SPARK-22907) MetadataFetchFailedException broadcast is already present

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22907:


Assignee: (was: Apache Spark)

> MetadataFetchFailedException broadcast is already present
> -
>
> Key: SPARK-22907
> URL: https://issues.apache.org/jira/browse/SPARK-22907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0
> Environment: Spark2.1.0 + yarn
>Reporter: liupengcheng
>
> Currently, some IOException(may caused by physical environment) may cause 
> MetadataFetchFailedException,  and the stage will be retried. however, when 
> in the retrying of the stage, if the task is scheduled at the same executors 
> where the first MetadataFetchFailedException happens, a 
> 'MetadataFetchFailedException: broadcast is already present' will be thrown.
> {noformat}
> org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: 
> java.lang.IllegalArgumentException: requirement failed: Block broadcast_22 is 
> already present in the MemoryStore
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1224)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:167)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675)
> at org.apache.spark.Logging$class.logInfo(Logging.scala:58)
> at 
> org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:612)
> at 
> org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:674)
> at 
> org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:202)
> at 
> org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:141)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:55)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140)
> at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
> at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1112)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFu

[jira] [Commented] (SPARK-22465) Cogroup of two disproportionate RDDs could lead into 2G limit BUG

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304569#comment-16304569
 ] 

Apache Spark commented on SPARK-22465:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/20091

> Cogroup of two disproportionate RDDs could lead into 2G limit BUG
> -
>
> Key: SPARK-22465
> URL: https://issues.apache.org/jira/browse/SPARK-22465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 
> 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Amit Kumar
>Priority: Critical
> Fix For: 2.3.0
>
>
> While running my spark pipeline, it failed with the following exception
> {noformat}
> 2017-11-03 04:49:09,776 [Executor task launch worker for task 58670] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 630.0 in stage 28.0 
> (TID 58670)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:469)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After debugging I found that the issue lies with how spark handles cogroup of 
> two RDDs.
> Here is the relevant code from apache spark
> {noformat}
>  /**
>* For each key k in `this` or `other`, return a resulting RDD that 
> contains a tuple with the
>* list of values for that key in `this` as well as `other`.
>*/
>   def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = 
> self.withScope {
> cogroup(other, defaultPartitioner(self, other))
>   }
> /**
>* Choose a partitioner to use for a cogroup-like operation between a 
> number of RDDs.
>*
>* If any of the RDDs already has a partitioner, choose that one.
>*
>* Otherwise, we use a default HashPartitioner. For the number of 
> partitions, if
>* spark.default.parallelism is set, then we'll use the value from 
> SparkContext
>* defaultParallelism, otherwise we'll use the max number of upstream 
> partitions.
>*
>* Unless spark.default.parallelism is set, the number of partitions will 
> be the
>* same as the number of partitions in the largest upstream RDD, as this 
> should
>* be least likely to cause out-of-memory errors.
>*
>* We use two method parameters (rdd, others) to enforce callers passing at 
> least 1 RDD.
>*/
>   def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
> val rdds = (Seq(rdd) ++ others)
> val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 
> 0))
> if (hasPartitioner.nonEmpty) {
>   hasPartitioner.maxBy(_.partitions.length).partitioner.get
> } else {
>   if (rdd.context.conf.contains("spark.default.parallelism")) {
> new HashPartitioner(rdd.context.defaultParallelism)
>   } else {
> new HashPartitioner(rdds.map(_.partitions.length).max)
>   }
> }
>   }
> {noformat}
> Given this  suppose we have two  pair RDDs.
> RDD1 : A small RDD which fewer data and partitions
> RDD2: A huge RDD which has loads of data and partitions
> Now in the code if we were to have a cogroup
> {noformat}
> val RDD3 = RDD1.cogroup(RDD2)
> {noformat}
> there is a case where this could lead to the SPARK-6235 Bug which is If RDD1 
> has a partitioner when it is being called into a cogroup. This is because the 
> cogroups partitions are then decided by the partitioner and could lead to the 
> huge RDD2 being shuffled into a small number of partitions.
> One way is probably to add a safety check here that woul

[jira] [Created] (SPARK-22908) add basic continuous kafka source

2017-12-27 Thread Jose Torres (JIRA)
Jose Torres created SPARK-22908:
---

 Summary: add basic continuous kafka source
 Key: SPARK-22908
 URL: https://issues.apache.org/jira/browse/SPARK-22908
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package

2017-12-27 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-22909:


 Summary: Move Structured Streaming v2 APIs to streaming package
 Key: SPARK-22909
 URL: https://issues.apache.org/jira/browse/SPARK-22909
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22909:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Move Structured Streaming v2 APIs to streaming package
> --
>
> Key: SPARK-22909
> URL: https://issues.apache.org/jira/browse/SPARK-22909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304729#comment-16304729
 ] 

Apache Spark commented on SPARK-22909:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/20093

> Move Structured Streaming v2 APIs to streaming package
> --
>
> Key: SPARK-22909
> URL: https://issues.apache.org/jira/browse/SPARK-22909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22909:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Move Structured Streaming v2 APIs to streaming package
> --
>
> Key: SPARK-22909
> URL: https://issues.apache.org/jira/browse/SPARK-22909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304734#comment-16304734
 ] 

Apache Spark commented on SPARK-20392:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20094

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
> 108_bucketizer_0786c

[jira] [Updated] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package

2017-12-27 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22909:
-
Priority: Blocker  (was: Major)

> Move Structured Streaming v2 APIs to streaming package
> --
>
> Key: SPARK-22909
> URL: https://issues.apache.org/jira/browse/SPARK-22909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package

2017-12-27 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22909:
-
Target Version/s: 2.3.0

> Move Structured Streaming v2 APIs to streaming package
> --
>
> Key: SPARK-22909
> URL: https://issues.apache.org/jira/browse/SPARK-22909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22910) Wrong results in Spark Job because failed to move to Trash

2017-12-27 Thread Ohad Raviv (JIRA)
Ohad Raviv created SPARK-22910:
--

 Summary: Wrong results in Spark Job because failed to move to Trash
 Key: SPARK-22910
 URL: https://issues.apache.org/jira/browse/SPARK-22910
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.0
Reporter: Ohad Raviv


Our Spark job has completed with status successful although the data save was 
corrupted.

What happened is that we have a monthly job. each run overwrites the output of 
the previous run. we happened to change the sql.shuffle.partitions number 
between the runs from 2000 to 1000, and what happened was that the new run had 
Warn failure of moving the old data to the user's .Trash because it was full. 
because it was only a warning the process continued and overwritten the new 
1000 files - while leaving most of the old remaining 1000 files in their place. 
this resulted that in the final output we had a folder with mix of old and new 
data and that caused corruption in the process.

the post mortem is relatively easy to understand.
{code}
hadoop fs -ls /the/folder
-rwxr-xr-x 3 spark_user spark_user 209012005 2017-12-10 14:20 
/the/folder/part-0.gz 
.
.
-rwxr-xr-x 3 spark_user spark_user 34899 2017-11-17 06:39 
/the/folder/part-01990.gz 
{code}
and in the driver's log:
{code}
17/12/10 15:10:00 WARN Hive: Directory hdfs:///the/folder cannot be removed: 
java.io.IOException: Failed to move to trash: hdfs:///the/folder/part-0.gz
java.io.IOException: Failed to move to trash: hdfs:///the/folder/part-0.gz
at 
org.apache.hadoop.fs.TrashPolicyDefault.moveToTrash(TrashPolicyDefault.java:160)
at org.apache.hadoop.fs.Trash.moveToTrash(Trash.java:109)
at org.apache.hadoop.fs.Trash.moveToAppropriateTrash(Trash.java:90)
at 
org.apache.hadoop.hive.shims.Hadoop23Shims.moveToAppropriateTrash(Hadoop23Shims.java:272)
at 
org.apache.hadoop.hive.common.FileUtils.moveToTrash(FileUtils.java:603)
at 
org.apache.hadoop.hive.common.FileUtils.trashFilesUnderDir(FileUtils.java:586)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2851)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1640)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:716)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:672)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:672)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:672)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:671)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:741)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:739)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:739)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:739)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:323)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.Sp

[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304765#comment-16304765
 ] 

Apache Spark commented on SPARK-22126:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/20095

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> Anyone who's following might want to scan the design doc (in the links 
> above), the latest api proposal is:
> {code}
> def fitMultiple(
> dataset: Dataset[_],
> paramMaps: Array[ParamMap]
>   ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]]
> {code}
> Old discussion:
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   // Fit models in a Future for training in parallel
>   val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { 
> callable =>
>  Future[Map[Int, Model[_]]] {
> val modelMap = callable.call()
> if (collectSubModelsParam) {
>...
> }
> modelMap
>  } (executionContext)
>   }
>   // Unpersist training data only when all models have trained
>   Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, 
> executionContext)
> .onComplete { _ => trainingDataset.unpersist() } (executionContext)
>   // Evaluate models in a Future that will calulate a metric and allow 
> model to be cleaned up
>   val foldMetricMapFutures = modelMapFutures.map { modelMapFuture =>
> modelMapFuture.map { modelMap =>
>   modelMap.map { case (index: Int, model: Model[_]) =>
> val metric = eval.evaluate(model.transform(validationDataset, 
> paramMaps(index)))
> (index, metric)

[jira] [Commented] (SPARK-22908) add basic continuous kafka source

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304811#comment-16304811
 ] 

Apache Spark commented on SPARK-22908:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20096

> add basic continuous kafka source
> -
>
> Key: SPARK-22908
> URL: https://issues.apache.org/jira/browse/SPARK-22908
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22908) add basic continuous kafka source

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22908:


Assignee: Apache Spark

> add basic continuous kafka source
> -
>
> Key: SPARK-22908
> URL: https://issues.apache.org/jira/browse/SPARK-22908
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22908) add basic continuous kafka source

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22908:


Assignee: (was: Apache Spark)

> add basic continuous kafka source
> -
>
> Key: SPARK-22908
> URL: https://issues.apache.org/jira/browse/SPARK-22908
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22911) Migrate structured streaming sources to new DataSourceV2 APIs

2017-12-27 Thread Jose Torres (JIRA)
Jose Torres created SPARK-22911:
---

 Summary: Migrate structured streaming sources to new DataSourceV2 
APIs
 Key: SPARK-22911
 URL: https://issues.apache.org/jira/browse/SPARK-22911
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22912) Support v2 streaming sources and sinks in MicroBatchExecution

2017-12-27 Thread Jose Torres (JIRA)
Jose Torres created SPARK-22912:
---

 Summary: Support v2 streaming sources and sinks in 
MicroBatchExecution
 Key: SPARK-22912
 URL: https://issues.apache.org/jira/browse/SPARK-22912
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22912) Support v2 streaming sources and sinks in MicroBatchExecution

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304817#comment-16304817
 ] 

Apache Spark commented on SPARK-22912:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20097

> Support v2 streaming sources and sinks in MicroBatchExecution
> -
>
> Key: SPARK-22912
> URL: https://issues.apache.org/jira/browse/SPARK-22912
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22912) Support v2 streaming sources and sinks in MicroBatchExecution

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22912:


Assignee: (was: Apache Spark)

> Support v2 streaming sources and sinks in MicroBatchExecution
> -
>
> Key: SPARK-22912
> URL: https://issues.apache.org/jira/browse/SPARK-22912
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22912) Support v2 streaming sources and sinks in MicroBatchExecution

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22912:


Assignee: Apache Spark

> Support v2 streaming sources and sinks in MicroBatchExecution
> -
>
> Key: SPARK-22912
> URL: https://issues.apache.org/jira/browse/SPARK-22912
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types

2017-12-27 Thread Ameen Tayyebi (JIRA)
Ameen Tayyebi created SPARK-22913:
-

 Summary: Hive Partition Pruning, Fractional and Timestamp types
 Key: SPARK-22913
 URL: https://issues.apache.org/jira/browse/SPARK-22913
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ameen Tayyebi
 Fix For: 2.3.0


Spark currently pushes the predicates it has in the SQL query to Hive 
Metastore. This only applies to predicates that are placed on top of 
partitioning columns. As more and more hive metastore implementations come 
around, this is an important optimization to allow data to be prefiltered to 
only relevant partitions. Consider the following example:

Table:
create external table data (key string, quantity long)
partitioned by (processing-date timestamp)

Query:
select * from data where processing-date = '2017-10-23 00:00:00'

Currently, no filters will be pushed to the hive metastore for the above query. 
The reason is that the code that tries to compute predicates to be sent to hive 
metastore, only deals with integral and string column types. It doesn't know 
how to handle fractional and timestamp columns.

I have tables in my metastore (AWS Glue) with millions of partitions of type 
timestamp and double. In my specific case, it takes Spark's master node about 
6.5 minutes to download all partitions for the table, and then filter the 
partitions client-side. The actual processing time of my query is only 6 
seconds. In other words, without partition pruning, I'm looking at 6.5 minutes 
of processing and with partition pruning, I'm looking at 6 seconds only.

I have a fix for this developed locally that I'll provide shortly as a pull 
request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22914) Subbing for spark.history.ui.port does not resolve by default

2017-12-27 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-22914:
-

 Summary: Subbing for spark.history.ui.port does not resolve by 
default
 Key: SPARK-22914
 URL: https://issues.apache.org/jira/browse/SPARK-22914
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.2.1
Reporter: Gera Shegalov


In order not to hardcode SHS web ui port and not duplicate information that is 
already configured we might be inclined to define 
{{spark.yarn.historyServer.address}} as 
{code}http://${hadoopconf-yarn.resourcemanager.hostname}:${spark.history.ui.port}{code}

However, since spark.history.ui.port is not registered its resolution fails 
when it's not explicitly set in the deployed spark conf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22914) Subbing for spark.history.ui.port does not resolve by default

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22914:


Assignee: (was: Apache Spark)

> Subbing for spark.history.ui.port does not resolve by default
> -
>
> Key: SPARK-22914
> URL: https://issues.apache.org/jira/browse/SPARK-22914
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>
> In order not to hardcode SHS web ui port and not duplicate information that 
> is already configured we might be inclined to define 
> {{spark.yarn.historyServer.address}} as 
> {code}http://${hadoopconf-yarn.resourcemanager.hostname}:${spark.history.ui.port}{code}
> However, since spark.history.ui.port is not registered its resolution fails 
> when it's not explicitly set in the deployed spark conf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22914) Subbing for spark.history.ui.port does not resolve by default

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304863#comment-16304863
 ] 

Apache Spark commented on SPARK-22914:
--

User 'gerashegalov' has created a pull request for this issue:
https://github.com/apache/spark/pull/20098

> Subbing for spark.history.ui.port does not resolve by default
> -
>
> Key: SPARK-22914
> URL: https://issues.apache.org/jira/browse/SPARK-22914
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>
> In order not to hardcode SHS web ui port and not duplicate information that 
> is already configured we might be inclined to define 
> {{spark.yarn.historyServer.address}} as 
> {code}http://${hadoopconf-yarn.resourcemanager.hostname}:${spark.history.ui.port}{code}
> However, since spark.history.ui.port is not registered its resolution fails 
> when it's not explicitly set in the deployed spark conf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22914) Subbing for spark.history.ui.port does not resolve by default

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22914:


Assignee: Apache Spark

> Subbing for spark.history.ui.port does not resolve by default
> -
>
> Key: SPARK-22914
> URL: https://issues.apache.org/jira/browse/SPARK-22914
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Assignee: Apache Spark
>
> In order not to hardcode SHS web ui port and not duplicate information that 
> is already configured we might be inclined to define 
> {{spark.yarn.historyServer.address}} as 
> {code}http://${hadoopconf-yarn.resourcemanager.hostname}:${spark.history.ui.port}{code}
> However, since spark.history.ui.port is not registered its resolution fails 
> when it's not explicitly set in the deployed spark conf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2017-12-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22883:
--
Description: 
*For featurizers with names from A - M*

Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843

  was:
Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843


> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2017-12-27 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-22915:
-

 Summary: ML test for StructuredStreaming: spark.ml.feature, N-Z
 Key: SPARK-22915
 URL: https://issues.apache.org/jira/browse/SPARK-22915
 Project: Spark
  Issue Type: Test
  Components: ML, Tests
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley


*For featurizers with names from A - M*

Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2017-12-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22883:
--
Summary: ML test for StructuredStreaming: spark.ml.feature, A-M  (was: ML 
test for StructuredStreaming: spark.ml.feature)

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2017-12-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22915:
--
Description: 
*For featurizers with names from N - Z*

Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843

  was:
*For featurizers with names from A - M*

Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843


> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22916) shouldn't bias towards build right if user does not specify

2017-12-27 Thread Feng Liu (JIRA)
Feng Liu created SPARK-22916:


 Summary: shouldn't bias towards build right if user does not 
specify
 Key: SPARK-22916
 URL: https://issues.apache.org/jira/browse/SPARK-22916
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Feng Liu


This is an issue very similar to SPARK-22489. When there are no broadcast 
hints, the current spark strategies will prefer to build right, without 
considering the sizes of the two sides. To reproduce:

{code:java}
import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec

spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value").createTempView("table1")
spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", 
"value").createTempView("table2")

val bl = sql(s"SELECT * FROM table1 t1 JOIN table2 t2 ON t1.key = 
t2.key").queryExecution.executedPlan
{code}

The plan is going to broadcast right side (`t2`), even though it is larger.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22916) shouldn't bias towards build right if user does not specify

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22916:


Assignee: Apache Spark

> shouldn't bias towards build right if user does not specify
> ---
>
> Key: SPARK-22916
> URL: https://issues.apache.org/jira/browse/SPARK-22916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Feng Liu
>Assignee: Apache Spark
>
> This is an issue very similar to SPARK-22489. When there are no broadcast 
> hints, the current spark strategies will prefer to build right, without 
> considering the sizes of the two sides. To reproduce:
> {code:java}
> import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
> spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value").createTempView("table1")
> spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", 
> "value").createTempView("table2")
> val bl = sql(s"SELECT * FROM table1 t1 JOIN table2 t2 ON t1.key = 
> t2.key").queryExecution.executedPlan
> {code}
> The plan is going to broadcast right side (`t2`), even though it is larger.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22916) shouldn't bias towards build right if user does not specify

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22916:


Assignee: (was: Apache Spark)

> shouldn't bias towards build right if user does not specify
> ---
>
> Key: SPARK-22916
> URL: https://issues.apache.org/jira/browse/SPARK-22916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Feng Liu
>
> This is an issue very similar to SPARK-22489. When there are no broadcast 
> hints, the current spark strategies will prefer to build right, without 
> considering the sizes of the two sides. To reproduce:
> {code:java}
> import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
> spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value").createTempView("table1")
> spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", 
> "value").createTempView("table2")
> val bl = sql(s"SELECT * FROM table1 t1 JOIN table2 t2 ON t1.key = 
> t2.key").queryExecution.executedPlan
> {code}
> The plan is going to broadcast right side (`t2`), even though it is larger.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22916) shouldn't bias towards build right if user does not specify

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304963#comment-16304963
 ] 

Apache Spark commented on SPARK-22916:
--

User 'liufengdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/20099

> shouldn't bias towards build right if user does not specify
> ---
>
> Key: SPARK-22916
> URL: https://issues.apache.org/jira/browse/SPARK-22916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Feng Liu
>
> This is an issue very similar to SPARK-22489. When there are no broadcast 
> hints, the current spark strategies will prefer to build right, without 
> considering the sizes of the two sides. To reproduce:
> {code:java}
> import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
> spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value").createTempView("table1")
> spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", 
> "value").createTempView("table2")
> val bl = sql(s"SELECT * FROM table1 t1 JOIN table2 t2 ON t1.key = 
> t2.key").queryExecution.executedPlan
> {code}
> The plan is going to broadcast right side (`t2`), even though it is larger.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22905) Fix ChiSqSelectorModel save implementation

2017-12-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22905:
--
Component/s: (was: ML)

> Fix ChiSqSelectorModel save implementation
> --
>
> Key: SPARK-22905
> URL: https://issues.apache.org/jira/browse/SPARK-22905
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, in `ChiSqSelectorModel`, save:
> {code}
> spark.createDataFrame(dataArray).repartition(1).write...
> {code}
> The default partition number used by createDataFrame is "defaultParallelism",
> Current RoundRobinPartitioning won't guarantee the "repartition" generating 
> the same order result with local array. We need fix it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22905) Fix ChiSqSelectorModel save implementation

2017-12-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22905:
--
Shepherd: Joseph K. Bradley

> Fix ChiSqSelectorModel save implementation
> --
>
> Key: SPARK-22905
> URL: https://issues.apache.org/jira/browse/SPARK-22905
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, in `ChiSqSelectorModel`, save:
> {code}
> spark.createDataFrame(dataArray).repartition(1).write...
> {code}
> The default partition number used by createDataFrame is "defaultParallelism",
> Current RoundRobinPartitioning won't guarantee the "repartition" generating 
> the same order result with local array. We need fix it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22888) OneVsRestModel does not work with Structured Streaming

2017-12-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22888.
---
Resolution: Duplicate

> OneVsRestModel does not work with Structured Streaming
> --
>
> Key: SPARK-22888
> URL: https://issues.apache.org/jira/browse/SPARK-22888
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> OneVsRestModel uses Dataset.persist, which does not work with streaming.  
> This should be avoided when the input is a streaming Dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22888) OneVsRestModel does not work with Structured Streaming

2017-12-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22888:
--
Target Version/s:   (was: 2.3.0)

> OneVsRestModel does not work with Structured Streaming
> --
>
> Key: SPARK-22888
> URL: https://issues.apache.org/jira/browse/SPARK-22888
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> OneVsRestModel uses Dataset.persist, which does not work with streaming.  
> This should be avoided when the input is a streaming Dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22899) OneVsRestModel transform on streaming data failed.

2017-12-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22899:
-

Assignee: Weichen Xu

> OneVsRestModel transform on streaming data failed.
> --
>
> Key: SPARK-22899
> URL: https://issues.apache.org/jira/browse/SPARK-22899
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> OneVsRestModel transform on streaming data failed.
> Because of it persisting the input dataset, which streaming do not support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22899) OneVsRestModel transform on streaming data failed.

2017-12-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22899.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Resolved by https://github.com/apache/spark/pull/20077

> OneVsRestModel transform on streaming data failed.
> --
>
> Key: SPARK-22899
> URL: https://issues.apache.org/jira/browse/SPARK-22899
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> OneVsRestModel transform on streaming data failed.
> Because of it persisting the input dataset, which streaming do not support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22913:


Assignee: Apache Spark

> Hive Partition Pruning, Fractional and Timestamp types
> --
>
> Key: SPARK-22913
> URL: https://issues.apache.org/jira/browse/SPARK-22913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ameen Tayyebi
>Assignee: Apache Spark
> Fix For: 2.3.0
>
>
> Spark currently pushes the predicates it has in the SQL query to Hive 
> Metastore. This only applies to predicates that are placed on top of 
> partitioning columns. As more and more hive metastore implementations come 
> around, this is an important optimization to allow data to be prefiltered to 
> only relevant partitions. Consider the following example:
> Table:
> create external table data (key string, quantity long)
> partitioned by (processing-date timestamp)
> Query:
> select * from data where processing-date = '2017-10-23 00:00:00'
> Currently, no filters will be pushed to the hive metastore for the above 
> query. The reason is that the code that tries to compute predicates to be 
> sent to hive metastore, only deals with integral and string column types. It 
> doesn't know how to handle fractional and timestamp columns.
> I have tables in my metastore (AWS Glue) with millions of partitions of type 
> timestamp and double. In my specific case, it takes Spark's master node about 
> 6.5 minutes to download all partitions for the table, and then filter the 
> partitions client-side. The actual processing time of my query is only 6 
> seconds. In other words, without partition pruning, I'm looking at 6.5 
> minutes of processing and with partition pruning, I'm looking at 6 seconds 
> only.
> I have a fix for this developed locally that I'll provide shortly as a pull 
> request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types

2017-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22913:


Assignee: (was: Apache Spark)

> Hive Partition Pruning, Fractional and Timestamp types
> --
>
> Key: SPARK-22913
> URL: https://issues.apache.org/jira/browse/SPARK-22913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ameen Tayyebi
> Fix For: 2.3.0
>
>
> Spark currently pushes the predicates it has in the SQL query to Hive 
> Metastore. This only applies to predicates that are placed on top of 
> partitioning columns. As more and more hive metastore implementations come 
> around, this is an important optimization to allow data to be prefiltered to 
> only relevant partitions. Consider the following example:
> Table:
> create external table data (key string, quantity long)
> partitioned by (processing-date timestamp)
> Query:
> select * from data where processing-date = '2017-10-23 00:00:00'
> Currently, no filters will be pushed to the hive metastore for the above 
> query. The reason is that the code that tries to compute predicates to be 
> sent to hive metastore, only deals with integral and string column types. It 
> doesn't know how to handle fractional and timestamp columns.
> I have tables in my metastore (AWS Glue) with millions of partitions of type 
> timestamp and double. In my specific case, it takes Spark's master node about 
> 6.5 minutes to download all partitions for the table, and then filter the 
> partitions client-side. The actual processing time of my query is only 6 
> seconds. In other words, without partition pruning, I'm looking at 6.5 
> minutes of processing and with partition pruning, I'm looking at 6 seconds 
> only.
> I have a fix for this developed locally that I'll provide shortly as a pull 
> request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types

2017-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305017#comment-16305017
 ] 

Apache Spark commented on SPARK-22913:
--

User 'ameent' has created a pull request for this issue:
https://github.com/apache/spark/pull/20100

> Hive Partition Pruning, Fractional and Timestamp types
> --
>
> Key: SPARK-22913
> URL: https://issues.apache.org/jira/browse/SPARK-22913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ameen Tayyebi
> Fix For: 2.3.0
>
>
> Spark currently pushes the predicates it has in the SQL query to Hive 
> Metastore. This only applies to predicates that are placed on top of 
> partitioning columns. As more and more hive metastore implementations come 
> around, this is an important optimization to allow data to be prefiltered to 
> only relevant partitions. Consider the following example:
> Table:
> create external table data (key string, quantity long)
> partitioned by (processing-date timestamp)
> Query:
> select * from data where processing-date = '2017-10-23 00:00:00'
> Currently, no filters will be pushed to the hive metastore for the above 
> query. The reason is that the code that tries to compute predicates to be 
> sent to hive metastore, only deals with integral and string column types. It 
> doesn't know how to handle fractional and timestamp columns.
> I have tables in my metastore (AWS Glue) with millions of partitions of type 
> timestamp and double. In my specific case, it takes Spark's master node about 
> 6.5 minutes to download all partitions for the table, and then filter the 
> partitions client-side. The actual processing time of my query is only 6 
> seconds. In other words, without partition pruning, I'm looking at 6.5 
> minutes of processing and with partition pruning, I'm looking at 6 seconds 
> only.
> I have a fix for this developed locally that I'll provide shortly as a pull 
> request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types

2017-12-27 Thread Ameen Tayyebi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305018#comment-16305018
 ] 

Ameen Tayyebi commented on SPARK-22913:
---

Note: I'll be away since December 30th until January 18th so I'll be checking 
up on this issue and the pull request at that time.

> Hive Partition Pruning, Fractional and Timestamp types
> --
>
> Key: SPARK-22913
> URL: https://issues.apache.org/jira/browse/SPARK-22913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ameen Tayyebi
> Fix For: 2.3.0
>
>
> Spark currently pushes the predicates it has in the SQL query to Hive 
> Metastore. This only applies to predicates that are placed on top of 
> partitioning columns. As more and more hive metastore implementations come 
> around, this is an important optimization to allow data to be prefiltered to 
> only relevant partitions. Consider the following example:
> Table:
> create external table data (key string, quantity long)
> partitioned by (processing-date timestamp)
> Query:
> select * from data where processing-date = '2017-10-23 00:00:00'
> Currently, no filters will be pushed to the hive metastore for the above 
> query. The reason is that the code that tries to compute predicates to be 
> sent to hive metastore, only deals with integral and string column types. It 
> doesn't know how to handle fractional and timestamp columns.
> I have tables in my metastore (AWS Glue) with millions of partitions of type 
> timestamp and double. In my specific case, it takes Spark's master node about 
> 6.5 minutes to download all partitions for the table, and then filter the 
> partitions client-side. The actual processing time of my query is only 6 
> seconds. In other words, without partition pruning, I'm looking at 6.5 
> minutes of processing and with partition pruning, I'm looking at 6 seconds 
> only.
> I have a fix for this developed locally that I'll provide shortly as a pull 
> request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types

2017-12-27 Thread Ameen Tayyebi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305018#comment-16305018
 ] 

Ameen Tayyebi edited comment on SPARK-22913 at 12/28/17 3:31 AM:
-

Note: I'll be away from December 30th until January 18th so I'll be checking up 
on this issue and the pull request at that time.


was (Author: ameen.tayy...@gmail.com):
Note: I'll be away since December 30th until January 18th so I'll be checking 
up on this issue and the pull request at that time.

> Hive Partition Pruning, Fractional and Timestamp types
> --
>
> Key: SPARK-22913
> URL: https://issues.apache.org/jira/browse/SPARK-22913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ameen Tayyebi
> Fix For: 2.3.0
>
>
> Spark currently pushes the predicates it has in the SQL query to Hive 
> Metastore. This only applies to predicates that are placed on top of 
> partitioning columns. As more and more hive metastore implementations come 
> around, this is an important optimization to allow data to be prefiltered to 
> only relevant partitions. Consider the following example:
> Table:
> create external table data (key string, quantity long)
> partitioned by (processing-date timestamp)
> Query:
> select * from data where processing-date = '2017-10-23 00:00:00'
> Currently, no filters will be pushed to the hive metastore for the above 
> query. The reason is that the code that tries to compute predicates to be 
> sent to hive metastore, only deals with integral and string column types. It 
> doesn't know how to handle fractional and timestamp columns.
> I have tables in my metastore (AWS Glue) with millions of partitions of type 
> timestamp and double. In my specific case, it takes Spark's master node about 
> 6.5 minutes to download all partitions for the table, and then filter the 
> partitions client-side. The actual processing time of my query is only 6 
> seconds. In other words, without partition pruning, I'm looking at 6.5 
> minutes of processing and with partition pruning, I'm looking at 6 seconds 
> only.
> I have a fix for this developed locally that I'll provide shortly as a pull 
> request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2017-12-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305032#comment-16305032
 ] 

Joseph K. Bradley commented on SPARK-22883:
---

I'll work on this.

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package

2017-12-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22909.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20093
[https://github.com/apache/spark/pull/20093]

> Move Structured Streaming v2 APIs to streaming package
> --
>
> Key: SPARK-22909
> URL: https://issues.apache.org/jira/browse/SPARK-22909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies

2017-12-27 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-22757:
-

Assignee: Yinan Li

> Init-container in the driver/executor pods for downloading remote dependencies
> --
>
> Key: SPARK-22757
> URL: https://issues.apache.org/jira/browse/SPARK-22757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Assignee: Yinan Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies

2017-12-27 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-22757.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19954
[https://github.com/apache/spark/pull/19954]

> Init-container in the driver/executor pods for downloading remote dependencies
> --
>
> Key: SPARK-22757
> URL: https://issues.apache.org/jira/browse/SPARK-22757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Assignee: Yinan Li
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2017-12-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305103#comment-16305103
 ] 

Hyukjin Kwon commented on SPARK-7721:
-

Hey [~rxin], I think I made it now by few modification of the script and 
forcing {{worker.py}} to produce the coverage results.
I ran it by Python 3 and Coverage 4.4 and all tests passed and just updated the 
site - https://spark-test.github.io/pyspark-coverage-site

FYI, here is the diff I used in the main codes to force it to produces (15ish 
lines addition)

{code}
diff --git a/python/pyspark/worker.py b/python/pyspark/worker.py
index e6737ae1c12..088debcf796 100644
--- a/python/pyspark/worker.py
+++ b/python/pyspark/worker.py
@@ -159,7 +159,7 @@ def read_udfs(pickleSer, infile, eval_type):
 return func, None, ser, ser


-def main(infile, outfile):
+def _main(infile, outfile):
 try:
 boot_time = time.time()
 split_index = read_int(infile)
@@ -259,6 +259,22 @@ def main(infile, outfile):
 exit(-1)


+if "COVERAGE_PROCESS_START" in os.environ:
+def _cov_wrapped(*args, **kwargs):
+import coverage
+cov = coverage.coverage(
+config_file=os.environ["COVERAGE_PROCESS_START"])
+cov.start()
+try:
+_main(*args, **kwargs)
+finally:
+cov.stop()
+cov.save()
+main = _cov_wrapped
+else:
+main = _main
+
+
 if __name__ == '__main__':
 # Read a local port to connect to from stdin
 java_port = int(sys.stdin.readline())
{code}



> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2017-12-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305108#comment-16305108
 ] 

Hyukjin Kwon commented on SPARK-7721:
-

I roughly checked the coverage results and seems fine. There is one trivial nit 
tho - 
https://github.com/apache/spark/blob/04e44b37cc04f62fbf9e08c7076349e0a4d12ea8/python/pyspark/daemon.py#L148-L169
 this scope is not in the coverage results as basically I am producing the 
coverage results in {{worker.py}} separately and then merging it. I believe 
it's not a big deal.

So, if you are fine for all now, how about if i proceed this by two PRs

1. Adding the script only (of course after cleaning up)

   Adding script alone should also be useful when reviewers check PRs, they can 
at least manually run it.

2. Integrating with Jenkins

  I have two thoughts for this:

  - Simplest one: Only run it in a specific mater in Jenkins and we always only 
keep a single up-to-date coverage site. It's simple. We can just simply push 
it. I think this is quite straightforward and pretty feasible. 

  - Another one: I make a simple site to list up all other coverages of all 
other builds (including PR builds) in git pages, and then leave a link in each 
PR's Jenkins build success message. I think this's also feasible but I think I 
need to take a look further.

BTW, I will be able to start to work on this from next week or two weeks after 
..

> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7721) Generate test coverage report from Python

2017-12-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305108#comment-16305108
 ] 

Hyukjin Kwon edited comment on SPARK-7721 at 12/28/17 7:11 AM:
---

I roughly checked the coverage results and seems fine. There is one trivial nit 
tho - 
https://github.com/apache/spark/blob/04e44b37cc04f62fbf9e08c7076349e0a4d12ea8/python/pyspark/daemon.py#L148-L169
 this scope is not in the coverage results as basically I am producing the 
coverage results in {{worker.py}} separately and then merging it. I believe 
it's not a big deal.

So, if you are fine for all now, how about if i proceed this by two PRs

1. Adding the script only (of course after cleaning up)

   Adding script alone should also be useful when reviewers check PRs, they can 
at least manually run it.

2. Integrating with Jenkins

  I have two thoughts for this:

  - Simplest one: Only run it in a specific mater in Jenkins and we always only 
keep a single up-to-date coverage site. It's simple. We can just simply push 
it. I think this is quite straightforward and pretty feasible. 

  - Another one: I make a simple site in the git pages to list up all other 
coverages of all other builds (including PR builds). We push the coverage html 
in Jenkins, and then leave a link in each PR's Jenkins build success message. I 
think this's also feasible but I think I need to take a look further.

BTW, I will be able to start to work on this from next week or two weeks after 
..


was (Author: hyukjin.kwon):
I roughly checked the coverage results and seems fine. There is one trivial nit 
tho - 
https://github.com/apache/spark/blob/04e44b37cc04f62fbf9e08c7076349e0a4d12ea8/python/pyspark/daemon.py#L148-L169
 this scope is not in the coverage results as basically I am producing the 
coverage results in {{worker.py}} separately and then merging it. I believe 
it's not a big deal.

So, if you are fine for all now, how about if i proceed this by two PRs

1. Adding the script only (of course after cleaning up)

   Adding script alone should also be useful when reviewers check PRs, they can 
at least manually run it.

2. Integrating with Jenkins

  I have two thoughts for this:

  - Simplest one: Only run it in a specific mater in Jenkins and we always only 
keep a single up-to-date coverage site. It's simple. We can just simply push 
it. I think this is quite straightforward and pretty feasible. 

  - Another one: I make a simple site to list up all other coverages of all 
other builds (including PR builds) in git pages, and then leave a link in each 
PR's Jenkins build success message. I think this's also feasible but I think I 
need to take a look further.

BTW, I will be able to start to work on this from next week or two weeks after 
..

> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org