date:20180718

[jira] [Commented] (SPARK-24858) Avoid unnecessary parquet footer reads

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548880#comment-16548880
 ] 

Apache Spark commented on SPARK-24858:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21814

> Avoid unnecessary parquet footer reads
> --
>
> Key: SPARK-24858
> URL: https://issues.apache.org/jira/browse/SPARK-24858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently the same Parquet footer is read twice in the function 
> `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is 
> enabled.
> Fix it with simple changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24858) Avoid unnecessary parquet footer reads

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24858:


Assignee: Apache Spark

> Avoid unnecessary parquet footer reads
> --
>
> Key: SPARK-24858
> URL: https://issues.apache.org/jira/browse/SPARK-24858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently the same Parquet footer is read twice in the function 
> `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is 
> enabled.
> Fix it with simple changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24858) Avoid unnecessary parquet footer reads

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24858:


Assignee: (was: Apache Spark)

> Avoid unnecessary parquet footer reads
> --
>
> Key: SPARK-24858
> URL: https://issues.apache.org/jira/browse/SPARK-24858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently the same Parquet footer is read twice in the function 
> `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is 
> enabled.
> Fix it with simple changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24858) Avoid unnecessary parquet footer reads

2018-07-18 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-24858:
--

 Summary: Avoid unnecessary parquet footer reads
 Key: SPARK-24858
 URL: https://issues.apache.org/jira/browse/SPARK-24858
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


Currently the same Parquet footer is read twice in the function 
`buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is 
enabled.

Fix it with simple changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24849) Convert StructType to DDL string

2018-07-18 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548878#comment-16548878
 ] 

Maxim Gekk commented on SPARK-24849:


[~maropu] This is a part of my work on customer's issue. There are multiple 
folders of AVRO files with pretty wide and nested schemas. I need 
programmatically create tables on top of each folder. To do that I read a file 
in a folder via Scala API, take schema, convert it to DDL string (here I need 
the changes) and put the string to SQL CREATE TABLE.

> Convert StructType to DDL string
> 
>
> Key: SPARK-24849
> URL: https://issues.apache.org/jira/browse/SPARK-24849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new methods which should convert a value of StructType to a 
> schema in DDL format . It should be possible to use the former string in new 
> table creation by just copy-pasting of new method results. The existing 
> methods simpleString(), catalogString() and sql() put ':' between top level 
> field name and its type, and wrap by the *struct* word
> {code}
> ds.schema.catalogString
> struct {code}
> Output of new method should be
> {code}
> metaData struct {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24857) required the sample code test the spark steaming job in kubernates and write the data in remote hdfs file system

2018-07-18 Thread kumpatla murali krishna (JIRA)

kumpatla murali krishna created SPARK-24857:
---

 Summary: required the sample code test the spark steaming job in 
kubernates and write the data in remote hdfs file system
 Key: SPARK-24857
 URL: https://issues.apache.org/jira/browse/SPARK-24857
 Project: Spark
  Issue Type: Test
  Components: Kubernetes, Spark Submit
Affects Versions: 2.3.1
Reporter: kumpatla murali krishna


./bin/spark-submit --master k8s://https://api.kubernates.aws.phenom.local   
  --deploy-mode cluster --name spark-pi --class  
com.phenom.analytics.executor.SummarizationJobExecutor --conf 
spark.executor.instances=5 --conf 
spark.kubernetes.container.image=phenommurali/spark_new  --jars  
hdfs://test-dev.com:8020/user/spark/jobs/Test_jar_without_jars.jar
error 
Normal SuccessfulMountVolume 2m kubelet, ip-x.ec2.internal 
MountVolume.SetUp succeeded for volume "download-files-volume" Warning 
FailedMount 2m kubelet, ip-.ec2.internal MountVolume.SetUp failed for 
volume "spark-init-properties" : configmaps 
"spark-pi-b5be4308783c3c479c6bf2f9da9b49dc-init-config" not found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.

2018-07-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548809#comment-16548809
 ] 

Hyukjin Kwon commented on SPARK-12126:
--

See the comment in the PR I left.

> JDBC datasource processes filters only commonly pushed down.
> 
>
> Key: SPARK-12126
> URL: https://issues.apache.org/jira/browse/SPARK-12126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Major
>
> As suggested 
> [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646],
>  Currently JDBC datasource only processes the filters pushed down from 
> {{DataSourceStrategy}}.
> Unlike ORC or Parquet, this can process pretty a lot of filters (for example, 
> a + b > 3) since it is just about string parsing.
> As 
> [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526],
>  using {{CatalystScan}} trait might be one of solutions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-07-18 Thread Jiang Xingbo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548784#comment-16548784
 ] 

Jiang Xingbo commented on SPARK-24375:
--

{quote}Is the 'barrier' logic pluggable ? Instead of only being a global sync 
point.
{quote}
The barrier() function is quite like 
[MPI_Barrier|https://www.mpich.org/static/docs/v3.2.1/www/www3/MPI_Barrier.html]
 function in MPI, the major purpose is to provide a way to do global sync 
between barrier tasks. I'm not sure whether we have plan to support pluggable 
logic for now, do you have a case in hand that require pluggable barrier() ?
{quote}Dynamic resource allocation (dra) triggers allocation of additional 
resources based on pending tasks - hence the comment We may add a check of 
total available slots before scheduling tasks from a barrier stage taskset. 
does not necessarily work in that context.
{quote}
Support running barrier stage with dynamic resource allocation is a Non-Goal 
here, however, we can improve the behavior to integrate better with DRA in 
Spark 3.0 .
{quote}Currently DRA in spark uniformly allocates resources - are we 
envisioning changes as part of this effort to allocate heterogenous executor 
resources based on pending tasks (atleast initially for barrier support for 
gpu's) ?
{quote}
There is another ongoing SPIP SPARK-24615 to add accelerator-aware task 
scheduling for Spark, I think we shall deal with the above issue within that 
topic.
{quote}In face of exceptions, some tasks will wait on barrier 2 and others on 
barrier 1 : causing issues.{quote}
It's not desired behavior to catch exception thrown by TaskContext.barrier() 
silently. However, in case this really happens, we can detect that because we 
have `epoch` both in driver side and executor side, more details will go to the 
design doc of BarrierTaskContext.barrier() SPARK-24581
 {quote}Can you elaborate more on leveraging TaskContext.localProperties ? Is 
it expected to be sync'ed after 'barrier' returns ? What gaurantees are we 
expecting to provide ?{quote}
We update the localProperties in driver and in executors you shall be able to 
fetch the updated values through TaskContext, it should not couple with 
`barrier()` function.

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24856) spark need upgrade Guava for use gRPC

2018-07-18 Thread alibaltschun (JIRA)

alibaltschun created SPARK-24856:


 Summary: spark need upgrade Guava for use gRPC
 Key: SPARK-24856
 URL: https://issues.apache.org/jira/browse/SPARK-24856
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Input/Output, Spark Core
Affects Versions: 2.3.1
Reporter: alibaltschun


hello, i have a problem about load spark model while using gRPC dependencies

 

i was posted on StackOverflow and someone says that coz spark used an old 
version of guava and gRPC need Guava V.20+. so that's mean spark need to update 
they guava version to fix this issue.

 

thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24840) do not use dummy filter to switch codegen on/off

2018-07-18 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24840.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21795
[https://github.com/apache/spark/pull/21795]

> do not use dummy filter to switch codegen on/off
> 
>
> Key: SPARK-24840
> URL: https://issues.apache.org/jira/browse/SPARK-24840
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23967) Description add native sql show in SQL page.

2018-07-18 Thread guoxiaolongzte (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548754#comment-16548754
 ] 

guoxiaolongzte commented on SPARK-23967:


I don't quite catch your meaning. Can you tell me more about it? 

> Description add native sql show in SQL page.
> 
>
> Key: SPARK-23967
> URL: https://issues.apache.org/jira/browse/SPARK-23967
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: JieFang.He
>Priority: Minor
>
> Description add native sql show in SQL page to for better observation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24701) SparkMaster WebUI allow all appids to be shown in detail on port 4040 rather than different ports per app

2018-07-18 Thread guoxiaolongzte (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548753#comment-16548753
 ] 

guoxiaolongzte commented on SPARK-24701:


I don't quite catch your meaning. Can you tell me more about it? It's best to 
have a snapshot

> SparkMaster WebUI allow all appids to be shown in detail on port 4040 rather 
> than different ports per app
> -
>
> Key: SPARK-24701
> URL: https://issues.apache.org/jira/browse/SPARK-24701
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: t oo
>Priority: Major
>  Labels: master, security, ui, web, web-ui
>
> Right now the detail for all application ids are shown on a diff port per app 
> id, ie. 4040, 4041, 4042...etc this is problematic for environments with 
> tight firewall settings. Proposing to allow 4040?appid=1,  4040?appid=2,  
> 4040?appid=3..etc for the master web ui just like what the History Web UI 
> does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23357) 'SHOW TABLE EXTENDED LIKE pattern=STRING' add ‘Partitioned’ display similar to hive, and partition is empty, also need to show empty partition field []

2018-07-18 Thread guoxiaolongzte (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte resolved SPARK-23357.

Resolution: Won't Fix

> 'SHOW TABLE EXTENDED LIKE pattern=STRING' add ‘Partitioned’ display similar 
> to hive, and partition is empty, also need to show empty partition field [] 
> 
>
> Key: SPARK-23357
> URL: https://issues.apache.org/jira/browse/SPARK-23357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png, 3.png, 4.png, 5.png
>
>
> 'SHOW TABLE EXTENDED LIKE pattern=STRING' add ‘Partitioned’ display similar 
> to hive, and partition is empty, also need to show empty partition field [] .
> hive:
>  !3.png! 
> sparkSQL Non-partitioned table  fix before:
>  !1.png! 
> sparkSQL partitioned table  fix before:
>  !2.png! 
> sparkSQL Non-partitioned table  fix after:
>  !4.png! 
> sparkSQL partitioned table  fix after:
>  !5.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24851:
--
Target Version/s:   (was: 2.3.1)

> Map a Stage ID to it's Associated Job ID in UI
> --
>
> Key: SPARK-24851
> URL: https://issues.apache.org/jira/browse/SPARK-24851
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Trivial
>
> It would be nice to have a field in Stage Page UI which would show mapping of 
> the current stage id to the job id's to which that stage belongs to. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22151:
--
Fix Version/s: (was: 2.4.0)

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-18 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548662#comment-16548662
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] I'm rewriting the design doc based on the comments mentioned 
above, so temporarily make it inaccessible, sorry about it, I will reopen it.

I think it is hard to control the memory usage per stage/task, because task is 
running in the executor which shared within a JVM. For CPU, yes I think we can 
do it, but I'm not sure the usage scenario of it.

For the requirement of using different types of machine, what I can think of is 
leveraging dynamic resource allocation. For example, if user wants run some MPI 
jobs with barrier enabled, then Spark could allocate some new executors with 
accelerator resource via cluster manager (for example using node label if it is 
running on YARN). But I will not target this as a goal in this design, since a) 
it is a non-goal for barrier scheduler currently; b) it makes the design too 
complex, would be better to separate to another work.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2018-07-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548659#comment-16548659
 ] 

Hyukjin Kwon commented on SPARK-24853:
--

I don't think we need an API just for consistency.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2
>Reporter: nirav patel
>Priority: Major
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24854) Gather all options into AvroOptions

2018-07-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24854.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21810
[https://github.com/apache/spark/pull/21810]

> Gather all options into AvroOptions
> ---
>
> Key: SPARK-24854
> URL: https://issues.apache.org/jira/browse/SPARK-24854
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Need to gather all Avro options into a class like in another datasources - 
> JSONOptions and CSVOptions. The map inside of the class should be case 
> insensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24854) Gather all options into AvroOptions

2018-07-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24854:


Assignee: Maxim Gekk

> Gather all options into AvroOptions
> ---
>
> Key: SPARK-24854
> URL: https://issues.apache.org/jira/browse/SPARK-24854
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Need to gather all Avro options into a class like in another datasources - 
> JSONOptions and CSVOptions. The map inside of the class should be case 
> insensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24855) Built-in AVRO support should support specified schema on write

2018-07-18 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24855:
---

Assignee: Brian Lindblom

> Built-in AVRO support should support specified schema on write
> --
>
> Key: SPARK-24855
> URL: https://issues.apache.org/jira/browse/SPARK-24855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Brian Lindblom
>Assignee: Brian Lindblom
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> spark-avro appears to have been brought in from an upstream project, 
> [https://github.com/databricks/spark-avro.]  I opened a PR a while ago to 
> enable support for 'forceSchema', which allows us to specify an AVRO schema 
> with which to write our records to handle some use cases we have.  I didn't 
> get this code merged but would like to add this feature to the AVRO 
> reader/writer code that was brought in.  The PR is here and I will follow up 
> with a more formal PR/Patch rebased on spark master branch: 
> https://github.com/databricks/spark-avro/pull/222
>  
> This change allows us to specify a schema, which should be compatible with 
> the schema generated by spark-avro from the dataset definition.  This allows 
> a user to do things like specify default values, change union ordering, or... 
> in the case where you're reading in an AVRO data set, doing some sort of 
> in-line field cleansing, then writing out with the original schema, preserve 
> that original schema in the output container files.  I've had several use 
> cases where this behavior was desired and there were several other asks for 
> this in the spark-avro project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24855) Built-in AVRO support should support specified schema on write

2018-07-18 Thread Brian Lindblom (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Lindblom updated SPARK-24855:
---
Description: 
spark-avro appears to have been brought in from an upstream project, 
[https://github.com/databricks/spark-avro.]  I opened a PR a while ago to 
enable support for 'forceSchema', which allows us to specify an AVRO schema 
with which to write our records to handle some use cases we have.  I didn't get 
this code merged but would like to add this feature to the AVRO reader/writer 
code that was brought in.  The PR is here and I will follow up with a more 
formal PR/Patch rebased on spark master branch: 
https://github.com/databricks/spark-avro/pull/222

 

This change allows us to specify a schema, which should be compatible with the 
schema generated by spark-avro from the dataset definition.  This allows a user 
to do things like specify default values, change union ordering, or... in the 
case where you're reading in an AVRO data set, doing some sort of in-line field 
cleansing, then writing out with the original schema, preserve that original 
schema in the output container files.  I've had several use cases where this 
behavior was desired and there were several other asks for this in the 
spark-avro project.

  was:
spark-avro appears to have been brought in from an upstream project, 
[https://github.com/databricks/spark-avro.]  I opened a PR a while ago to 
enable support for 'forceSchema', which allows us to specify an AVRO schema 
with which to write our records to handle some use cases we have.  I didn't get 
this code merged but would like to add this feature to the AVRO reader/writer 
code that was brought in.  The PR is here and I will follow up with a more 
formal PR/Patch rebased on spark master branch.

 

This change allows us to specify a schema, which should be compatible with the 
schema generated by spark-avro from the dataset definition.  This allows a user 
to do things like specify default values, change union ordering, or... in the 
case where you're reading in an AVRO data set, doing some sort of in-line field 
cleansing, then writing out with the original schema, preserve that original 
schema in the output container files.  I've had several use cases where this 
behavior was desired and there were several other asks for this in the 
spark-avro project.


> Built-in AVRO support should support specified schema on write
> --
>
> Key: SPARK-24855
> URL: https://issues.apache.org/jira/browse/SPARK-24855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Brian Lindblom
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> spark-avro appears to have been brought in from an upstream project, 
> [https://github.com/databricks/spark-avro.]  I opened a PR a while ago to 
> enable support for 'forceSchema', which allows us to specify an AVRO schema 
> with which to write our records to handle some use cases we have.  I didn't 
> get this code merged but would like to add this feature to the AVRO 
> reader/writer code that was brought in.  The PR is here and I will follow up 
> with a more formal PR/Patch rebased on spark master branch: 
> https://github.com/databricks/spark-avro/pull/222
>  
> This change allows us to specify a schema, which should be compatible with 
> the schema generated by spark-avro from the dataset definition.  This allows 
> a user to do things like specify default values, change union ordering, or... 
> in the case where you're reading in an AVRO data set, doing some sort of 
> in-line field cleansing, then writing out with the original schema, preserve 
> that original schema in the output container files.  I've had several use 
> cases where this behavior was desired and there were several other asks for 
> this in the spark-avro project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24855) Built-in AVRO support should support specified schema on write

2018-07-18 Thread Brian Lindblom (JIRA)

Brian Lindblom created SPARK-24855:
--

 Summary: Built-in AVRO support should support specified schema on 
write
 Key: SPARK-24855
 URL: https://issues.apache.org/jira/browse/SPARK-24855
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Brian Lindblom


spark-avro appears to have been brought in from an upstream project, 
[https://github.com/databricks/spark-avro.]  I opened a PR a while ago to 
enable support for 'forceSchema', which allows us to specify an AVRO schema 
with which to write our records to handle some use cases we have.  I didn't get 
this code merged but would like to add this feature to the AVRO reader/writer 
code that was brought in.  The PR is here and I will follow up with a more 
formal PR/Patch rebased on spark master branch.

 

This change allows us to specify a schema, which should be compatible with the 
schema generated by spark-avro from the dataset definition.  This allows a user 
to do things like specify default values, change union ordering, or... in the 
case where you're reading in an AVRO data set, doing some sort of in-line field 
cleansing, then writing out with the original schema, preserve that original 
schema in the output container files.  I've had several use cases where this 
behavior was desired and there were several other asks for this in the 
spark-avro project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548578#comment-16548578
 ] 

Apache Spark commented on SPARK-24801:
--

User 'countmdm' has created a pull request for this issue:
https://github.com/apache/spark/pull/21811

> Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can 
> waste a lot of memory
> ---
>
> Key: SPARK-24801
> URL: https://issues.apache.org/jira/browse/SPARK-24801
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>
> I recently analyzed another Yarn NM heap dump with jxray 
> ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is 
> wasted by empty (all zeroes) byte[] arrays. Most of these arrays are 
> referenced by 
> {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in 
> turn come from 
> {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is 
> the full reference chain that leads to the problematic arrays:
> {code:java}
> 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%)
> ↖org.apache.spark.network.util.ByteArrayWritableChannel.data
> ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next}
> ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry
> ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
> ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code}
>  
> Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that 
> byteChannel is always initialized eagerly in the constructor:
> {code:java}
> this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code}
> So I think to address the problem of empty byte[] arrays flooding the memory, 
> we should initialize {{byteChannel}} lazily, upon the first use. As far as I 
> can see, it's used only in one method, {{private void nextChunk()}}.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24801:


Assignee: (was: Apache Spark)

> Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can 
> waste a lot of memory
> ---
>
> Key: SPARK-24801
> URL: https://issues.apache.org/jira/browse/SPARK-24801
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>
> I recently analyzed another Yarn NM heap dump with jxray 
> ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is 
> wasted by empty (all zeroes) byte[] arrays. Most of these arrays are 
> referenced by 
> {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in 
> turn come from 
> {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is 
> the full reference chain that leads to the problematic arrays:
> {code:java}
> 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%)
> ↖org.apache.spark.network.util.ByteArrayWritableChannel.data
> ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next}
> ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry
> ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
> ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code}
>  
> Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that 
> byteChannel is always initialized eagerly in the constructor:
> {code:java}
> this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code}
> So I think to address the problem of empty byte[] arrays flooding the memory, 
> we should initialize {{byteChannel}} lazily, upon the first use. As far as I 
> can see, it's used only in one method, {{private void nextChunk()}}.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24801:


Assignee: Apache Spark

> Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can 
> waste a lot of memory
> ---
>
> Key: SPARK-24801
> URL: https://issues.apache.org/jira/browse/SPARK-24801
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Assignee: Apache Spark
>Priority: Major
>
> I recently analyzed another Yarn NM heap dump with jxray 
> ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is 
> wasted by empty (all zeroes) byte[] arrays. Most of these arrays are 
> referenced by 
> {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in 
> turn come from 
> {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is 
> the full reference chain that leads to the problematic arrays:
> {code:java}
> 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%)
> ↖org.apache.spark.network.util.ByteArrayWritableChannel.data
> ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next}
> ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry
> ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
> ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code}
>  
> Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that 
> byteChannel is always initialized eagerly in the constructor:
> {code:java}
> this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code}
> So I think to address the problem of empty byte[] arrays flooding the memory, 
> we should initialize {{byteChannel}} lazily, upon the first use. As far as I 
> can see, it's used only in one method, {{private void nextChunk()}}.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21261) SparkSQL regexpExpressions example

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21261:
-

Assignee: Sean Owen

> SparkSQL regexpExpressions example 
> ---
>
> Key: SPARK-21261
> URL: https://issues.apache.org/jira/browse/SPARK-21261
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 2.1.1
>Reporter: zhangxin
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.4.0
>
>
> The follow execute result.
> scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') 
> """).show
> +--+
> |regexp_replace(100-200, (d+), num)|
> +--+
> |   100-200|
> +--+
> scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') 
> """).show
> +---+
> |regexp_replace(100-200, (\d+), num)|
> +---+
> |num-num|
> +---+
> Add Comment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21261) SparkSQL regexpExpressions example

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21261.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21808
[https://github.com/apache/spark/pull/21808]

> SparkSQL regexpExpressions example 
> ---
>
> Key: SPARK-21261
> URL: https://issues.apache.org/jira/browse/SPARK-21261
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 2.1.1
>Reporter: zhangxin
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.4.0
>
>
> The follow execute result.
> scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') 
> """).show
> +--+
> |regexp_replace(100-200, (d+), num)|
> +--+
> |   100-200|
> +--+
> scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') 
> """).show
> +---+
> |regexp_replace(100-200, (\d+), num)|
> +---+
> |num-num|
> +---+
> Add Comment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21261) SparkSQL regexpExpressions example

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21261:
--
Priority: Minor  (was: Major)

> SparkSQL regexpExpressions example 
> ---
>
> Key: SPARK-21261
> URL: https://issues.apache.org/jira/browse/SPARK-21261
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 2.1.1
>Reporter: zhangxin
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.4.0
>
>
> The follow execute result.
> scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') 
> """).show
> +--+
> |regexp_replace(100-200, (d+), num)|
> +--+
> |   100-200|
> +--+
> scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') 
> """).show
> +---+
> |regexp_replace(100-200, (\d+), num)|
> +---+
> |num-num|
> +---+
> Add Comment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24814) Relationship between catalog and datasources

2018-07-18 Thread Bruce Robbins (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-24814:
--
Description: 
This is somewhat related, though not identical to, [~rdblue]'s SPIP on 
datasources and catalogs.

Here are the requirements (IMO) for fully implementing V2 datasources and their 
relationships to catalogs:
 # The global catalog should be configurable (the default can be HMS, but it 
should be overridable).
 # The default catalog (or an explicitly specified catalog in a query, once 
multiple catalogs are supported) can determine the V2 datasource to use for 
reading and writing the data.
 # Conversely, a V2 datasource can determine which catalog to use for 
resolution (e.g., if the user issues 
{{spark.read.format("acmex").table("mytable")}}, the acmex datasource would 
decide which catalog to use for resolving “mytable”).

  was:
This is somewhat related, though not identical to, Ryan Blue's SPIP on 
datasources and catalogs.

Here are the requirements (IMO) for fully implementing V2 datasources and their 
relationships to catalogs:
 # The global catalog should be configurable (the default can be HMS, but it 
should be overridable).
 # The default catalog (or an explicitly specified catalog in a query, once 
multiple catalogs are supported) can determine the V2 datasource to use for 
reading and writing the data.
 # Conversely, a V2 datasource can determine which catalog to use for 
resolution (e.g., if the user issues 
{{spark.read.format("acmex").table("mytable")}}, the acmex datasource would 
decide which catalog to use for resolving “mytable”).


> Relationship between catalog and datasources
> 
>
> Key: SPARK-24814
> URL: https://issues.apache.org/jira/browse/SPARK-24814
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This is somewhat related, though not identical to, [~rdblue]'s SPIP on 
> datasources and catalogs.
> Here are the requirements (IMO) for fully implementing V2 datasources and 
> their relationships to catalogs:
>  # The global catalog should be configurable (the default can be HMS, but it 
> should be overridable).
>  # The default catalog (or an explicitly specified catalog in a query, once 
> multiple catalogs are supported) can determine the V2 datasource to use for 
> reading and writing the data.
>  # Conversely, a V2 datasource can determine which catalog to use for 
> resolution (e.g., if the user issues 
> {{spark.read.format("acmex").table("mytable")}}, the acmex datasource would 
> decide which catalog to use for resolving “mytable”).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support

2018-07-18 Thread Parth Gandhi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548452#comment-16548452
 ] 

Parth Gandhi edited comment on SPARK-18186 at 7/18/18 9:54 PM:
---

Hi [~lian cheng], [~yhuai], there has been an issue lately with the library 
sketches-hive([https://github.com/DataSketches/sketches-hive)] that builds and 
runs a hive udaf on top of Spark SQL. In their method getNewAggregationBuffer() 
[https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/DataToSketchUDAF.java#L106,]
 they are initializing different state objects for modes Partial1 and Partial2. 
Their code used to work well with Spark 2.1 when Spark had support for mode 
"Complete". However, after it started supporting partial aggregation in Spark 
2.2 onwards, their code gives an issue when partial merge is invoked here 
[https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56],
 as the wrong state object is being passed in the merge function. I was just 
trying to understand the PR and wondering why did Spark stop supporting 
Complete mode in Hive UDAF or is there a way to still run in Complete mode 
which I am not aware of. Thank you.


was (Author: pgandhi):
Hi, there has been an issue lately with the library 
sketches-hive([https://github.com/DataSketches/sketches-hive)] that builds and 
runs a hive udaf on top of Spark SQL. In their method getNewAggregationBuffer() 
[https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/DataToSketchUDAF.java#L106,]
 they are initializing different state objects for modes Partial1 and Partial2. 
Their code used to work well with Spark 2.1 when Spark had support for mode 
"Complete". However, after it started supporting partial aggregation in Spark 
2.2 onwards, their code gives an issue when partial merge is invoked here 
[https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56],
 as the wrong state object is being passed in the merge function. I was just 
trying to understand the PR and wondering why did Spark stop supporting 
Complete mode in Hive UDAF or is there a way to still run in Complete mode 
which I am not aware of. Thank you.

> Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation 
> support
> 
>
> Key: SPARK-18186
> URL: https://issues.apache.org/jira/browse/SPARK-18186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.2.0
>
>
> Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any 
> query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} 
> without partial aggregation.
> This issue can be fixed by migrating {{HiveUDAFFunction}} to 
> {{TypedImperativeAggregate}}, which already provides partial aggregation 
> support for aggregate functions that may use arbitrary Java objects as 
> aggregation states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support

2018-07-18 Thread Parth Gandhi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548452#comment-16548452
 ] 

Parth Gandhi commented on SPARK-18186:
--

Hi, there has been an issue lately with the library 
sketches-hive([https://github.com/DataSketches/sketches-hive)] that builds and 
runs a hive udaf on top of Spark SQL. In their method getNewAggregationBuffer() 
[https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/DataToSketchUDAF.java#L106,]
 they are initializing different state objects for modes Partial1 and Partial2. 
Their code used to work well with Spark 2.1 when Spark had support for mode 
"Complete". However, after it started supporting partial aggregation in Spark 
2.2 onwards, their code gives an issue when partial merge is invoked here 
[https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56],
 as the wrong state object is being passed in the merge function. I was just 
trying to understand the PR and wondering why did Spark stop supporting 
Complete mode in Hive UDAF or is there a way to still run in Complete mode 
which I am not aware of. Thank you.

> Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation 
> support
> 
>
> Key: SPARK-18186
> URL: https://issues.apache.org/jira/browse/SPARK-18186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.2.0
>
>
> Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any 
> query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} 
> without partial aggregation.
> This issue can be fixed by migrating {{HiveUDAFFunction}} to 
> {{TypedImperativeAggregate}}, which already provides partial aggregation 
> support for aggregate functions that may use arbitrary Java objects as 
> aggregation states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24854) Gather all options into AvroOptions

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548441#comment-16548441
 ] 

Apache Spark commented on SPARK-24854:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21810

> Gather all options into AvroOptions
> ---
>
> Key: SPARK-24854
> URL: https://issues.apache.org/jira/browse/SPARK-24854
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to gather all Avro options into a class like in another datasources - 
> JSONOptions and CSVOptions. The map inside of the class should be case 
> insensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24854) Gather all options into AvroOptions

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24854:


Assignee: Apache Spark

> Gather all options into AvroOptions
> ---
>
> Key: SPARK-24854
> URL: https://issues.apache.org/jira/browse/SPARK-24854
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Need to gather all Avro options into a class like in another datasources - 
> JSONOptions and CSVOptions. The map inside of the class should be case 
> insensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24854) Gather all options into AvroOptions

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24854:


Assignee: (was: Apache Spark)

> Gather all options into AvroOptions
> ---
>
> Key: SPARK-24854
> URL: https://issues.apache.org/jira/browse/SPARK-24854
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to gather all Avro options into a class like in another datasources - 
> JSONOptions and CSVOptions. The map inside of the class should be case 
> insensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24854) Gather all options into AvroOptions

2018-07-18 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-24854:
--

 Summary: Gather all options into AvroOptions
 Key: SPARK-24854
 URL: https://issues.apache.org/jira/browse/SPARK-24854
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to gather all Avro options into a class like in another datasources - 
JSONOptions and CSVOptions. The map inside of the class should be case 
insensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23908) High-order function: transform(array, function) → array

2018-07-18 Thread Herman van Hovell (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548427#comment-16548427
 ] 

Herman van Hovell edited comment on SPARK-23908 at 7/18/18 9:30 PM:


Yeah I am, sorry for the hold up. I'll try to have something out ASAP.

BTW: I don't see a target version set, the affected version is (which is a bit 
weird for a feature).


was (Author: hvanhovell):
Yeah I am, sorry for the hold up. I'll try to have something out ASAP.

> High-order function: transform(array, function) → array
> ---
>
> Key: SPARK-23908
> URL: https://issues.apache.org/jira/browse/SPARK-23908
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Herman van Hovell
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array that is the result of applying function to each element of 
> array:
> {noformat}
> SELECT transform(ARRAY [], x -> x + 1); -- []
> SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7]
> SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7]
> SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', 
> 'z0']
> SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x 
> -> x IS NOT NULL)); -- [[1, 2], [3]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23908) High-order function: transform(array, function) → array

2018-07-18 Thread Herman van Hovell (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548427#comment-16548427
 ] 

Herman van Hovell commented on SPARK-23908:
---

Yeah I am, sorry for the hold up. I'll try to have something out ASAP.

> High-order function: transform(array, function) → array
> ---
>
> Key: SPARK-23908
> URL: https://issues.apache.org/jira/browse/SPARK-23908
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Herman van Hovell
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array that is the result of applying function to each element of 
> array:
> {noformat}
> SELECT transform(ARRAY [], x -> x + 1); -- []
> SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7]
> SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7]
> SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', 
> 'z0']
> SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x 
> -> x IS NOT NULL)); -- [[1, 2], [3]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23908) High-order function: transform(array, function) → array

2018-07-18 Thread Frederick Reiss (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548421#comment-16548421
 ] 

Frederick Reiss commented on SPARK-23908:
-

This Jira is marked as "in progress" with the target set to a previous release 
of Spark. Are you working on this, [~hvanhovell]?

> High-order function: transform(array, function) → array
> ---
>
> Key: SPARK-23908
> URL: https://issues.apache.org/jira/browse/SPARK-23908
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Herman van Hovell
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array that is the result of applying function to each element of 
> array:
> {noformat}
> SELECT transform(ARRAY [], x -> x + 1); -- []
> SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7]
> SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7]
> SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', 
> 'z0']
> SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x 
> -> x IS NOT NULL)); -- [[1, 2], [3]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24129) Add option to pass --build-arg's to docker-image-tool.sh

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24129.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21202
[https://github.com/apache/spark/pull/21202]

> Add option to pass --build-arg's to docker-image-tool.sh
> 
>
> Key: SPARK-24129
> URL: https://issues.apache.org/jira/browse/SPARK-24129
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 2.4.0
>
>
> When we are working behind the firewall, we may need to pass the proxy 
> details as part of the docker --build-arg parameters to build the image. But 
> docker-image-tool.sh doesn't provide option to pass the proxy details or the 
> --build-arg to the docker command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24129) Add option to pass --build-arg's to docker-image-tool.sh

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24129:
-

Assignee: Devaraj K

> Add option to pass --build-arg's to docker-image-tool.sh
> 
>
> Key: SPARK-24129
> URL: https://issues.apache.org/jira/browse/SPARK-24129
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 2.4.0
>
>
> When we are working behind the firewall, we may need to pass the proxy 
> details as part of the docker --build-arg parameters to build the image. But 
> docker-image-tool.sh doesn't provide option to pass the proxy details or the 
> --build-arg to the docker command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-18 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-24825.
-
Resolution: Fixed

PR pushed, builds green, and now we have slightly more spammy build logs!  :)

thanks [~mcheah]

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-18 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24852:
--
Shepherd: Joseph K. Bradley

> Have spark.ml training use updated `Instrumentation` APIs.
> --
>
> Key: SPARK-24852
> URL: https://issues.apache.org/jira/browse/SPARK-24852
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
>
> Port spark.ml code to use the new methods on the `Instrumentation` class and 
> remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-18 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24852:
-

Assignee: Bago Amirbekian

> Have spark.ml training use updated `Instrumentation` APIs.
> --
>
> Key: SPARK-24852
> URL: https://issues.apache.org/jira/browse/SPARK-24852
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
>
> Port spark.ml code to use the new methods on the `Instrumentation` class and 
> remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2018-07-18 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-22151:
-

Assignee: Parth Gandhi

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.4.0
>
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2018-07-18 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-22151:
--
Fix Version/s: 2.4.0

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.4.0
>
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24677) TaskSetManager not updating successfulTaskDurations for old stage attempts

2018-07-18 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24677:
--
Fix Version/s: 2.2.3

> TaskSetManager not updating successfulTaskDurations for old stage attempts
> --
>
> Key: SPARK-24677
> URL: https://issues.apache.org/jira/browse/SPARK-24677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 2.2.3, 2.4.0, 2.3.3
>
>
> When introducing SPARK-23433 , maybe cause stop sparkcontext.
> {code:java}
> ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
> SparkContext
> java.util.NoSuchElementException: MedianHeap is empty.
> at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
> at 
> org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2018-07-18 Thread nirav patel (JIRA)

nirav patel created SPARK-24853:
---

 Summary: Support Column type for withColumn and withColumnRenamed 
apis
 Key: SPARK-24853
 URL: https://issues.apache.org/jira/browse/SPARK-24853
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.2
Reporter: nirav patel


Can we add overloaded version of withColumn or withColumnRenamed that accept 
Column type instead of String? That way I can specify FQN in case when there is 
duplicate column names. e.g. if I have 2 columns with same name as a result of 
join and I want to rename one of the field I can do it with this new API.
 
This would be similar to Drop api which supports both String and Column type.
 
def
withColumn(colName: Column, col: Column): DataFrame
Returns a new Dataset by adding a column or replacing the existing column that 
has the same name.
 
def
withColumnRenamed(existingName: Column, newName: Column): DataFrame
Returns a new Dataset with a column renamed.
 
 
 
I think there should also be this one:
 
def
withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
Returns a new Dataset with a column renamed.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2018-07-18 Thread Kyle Prifogle (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548140#comment-16548140
 ] 

Kyle Prifogle edited comment on SPARK-12449 at 7/18/18 6:44 PM:


What happened to this initiative?  I came here trying to figure out why 
".limit(10)" seemed to scan the entire table.


was (Author: kprifogle1):
What happened to this initiative?  I came here trying to figure out why 
".limit(10)" seemed to scan the entire table.

Is slow down in some of this (seemingly critical) work an indication that the 
breaks have been put on open source spark and that databricks run time is the 
only future?

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
>Priority: Major
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24851:


Assignee: Apache Spark

> Map a Stage ID to it's Associated Job ID in UI
> --
>
> Key: SPARK-24851
> URL: https://issues.apache.org/jira/browse/SPARK-24851
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Parth Gandhi
>Assignee: Apache Spark
>Priority: Trivial
>
> It would be nice to have a field in Stage Page UI which would show mapping of 
> the current stage id to the job id's to which that stage belongs to. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548236#comment-16548236
 ] 

Apache Spark commented on SPARK-24851:
--

User 'pgandhi999' has created a pull request for this issue:
https://github.com/apache/spark/pull/21809

> Map a Stage ID to it's Associated Job ID in UI
> --
>
> Key: SPARK-24851
> URL: https://issues.apache.org/jira/browse/SPARK-24851
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Trivial
>
> It would be nice to have a field in Stage Page UI which would show mapping of 
> the current stage id to the job id's to which that stage belongs to. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24851:


Assignee: (was: Apache Spark)

> Map a Stage ID to it's Associated Job ID in UI
> --
>
> Key: SPARK-24851
> URL: https://issues.apache.org/jira/browse/SPARK-24851
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Trivial
>
> It would be nice to have a field in Stage Page UI which would show mapping of 
> the current stage id to the job id's to which that stage belongs to. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24677) TaskSetManager not updating successfulTaskDurations for old stage attempts

2018-07-18 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-24677.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.3

> TaskSetManager not updating successfulTaskDurations for old stage attempts
> --
>
> Key: SPARK-24677
> URL: https://issues.apache.org/jira/browse/SPARK-24677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 2.3.3, 2.4.0
>
>
> When introducing SPARK-23433 , maybe cause stop sparkcontext.
> {code:java}
> ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
> SparkContext
> java.util.NoSuchElementException: MedianHeap is empty.
> at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
> at 
> org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24677) TaskSetManager not updating successfulTaskDurations for old stage attempts

2018-07-18 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-24677:
-

Assignee: dzcxzl

> TaskSetManager not updating successfulTaskDurations for old stage attempts
> --
>
> Key: SPARK-24677
> URL: https://issues.apache.org/jira/browse/SPARK-24677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
>
> When introducing SPARK-23433 , maybe cause stop sparkcontext.
> {code:java}
> ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
> SparkContext
> java.util.NoSuchElementException: MedianHeap is empty.
> at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
> at 
> org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24677) TaskSetManager not updating successfulTaskDurations for old stage attempts

2018-07-18 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24677:
--
Summary: TaskSetManager not updating successfulTaskDurations for old stage 
attempts  (was: Avoid NoSuchElementException from MedianHeap)

> TaskSetManager not updating successfulTaskDurations for old stage attempts
> --
>
> Key: SPARK-24677
> URL: https://issues.apache.org/jira/browse/SPARK-24677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: dzcxzl
>Priority: Critical
>
> When introducing SPARK-23433 , maybe cause stop sparkcontext.
> {code:java}
> ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
> SparkContext
> java.util.NoSuchElementException: MedianHeap is empty.
> at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
> at 
> org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24677) Avoid NoSuchElementException from MedianHeap

2018-07-18 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548210#comment-16548210
 ] 

Thomas Graves edited comment on SPARK-24677 at 7/18/18 6:22 PM:


This is really that it isn't updating successfulTaskDurations. In this case one 
of the older stage attempts (that is a zombie) marked the task as successful 
but then the newest stage attempt checked to see if it needed to speculate


was (Author: tgraves):
In this case one of the older stage attempts (that is a zombie) marked the task 
as successful but then the newest stage attempt checked to see if it needed to 
speculate

> Avoid NoSuchElementException from MedianHeap
> 
>
> Key: SPARK-24677
> URL: https://issues.apache.org/jira/browse/SPARK-24677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: dzcxzl
>Priority: Critical
>
> When introducing SPARK-23433 , maybe cause stop sparkcontext.
> {code:java}
> ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
> SparkContext
> java.util.NoSuchElementException: MedianHeap is empty.
> at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
> at 
> org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24677) Avoid NoSuchElementException from MedianHeap

2018-07-18 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548210#comment-16548210
 ] 

Thomas Graves commented on SPARK-24677:
---

In this case one of the older stage attempts (that is a zombie) marked the task 
as successful but then the newest stage attempt checked to see if it needed to 
speculate

> Avoid NoSuchElementException from MedianHeap
> 
>
> Key: SPARK-24677
> URL: https://issues.apache.org/jira/browse/SPARK-24677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: dzcxzl
>Priority: Critical
>
> When introducing SPARK-23433 , maybe cause stop sparkcontext.
> {code:java}
> ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping 
> SparkContext
> java.util.NoSuchElementException: MedianHeap is empty.
> at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83)
> at 
> org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94)
> at 
> org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548137#comment-16548137
 ] 

Apache Spark commented on SPARK-24852:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/21799

> Have spark.ml training use updated `Instrumentation` APIs.
> --
>
> Key: SPARK-24852
> URL: https://issues.apache.org/jira/browse/SPARK-24852
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> Port spark.ml code to use the new methods on the `Instrumentation` class and 
> remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2018-07-18 Thread Kyle Prifogle (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548140#comment-16548140
 ] 

Kyle Prifogle commented on SPARK-12449:
---

What happened to this initiative?  I came here trying to figure out why 
".limit(10)" seemed to scan the entire table.

Is slow down in some of this (seemingly critical) work an indication that the 
breaks have been put on open source spark and that databricks run time is the 
only future?

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
>Priority: Major
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24852:


Assignee: (was: Apache Spark)

> Have spark.ml training use updated `Instrumentation` APIs.
> --
>
> Key: SPARK-24852
> URL: https://issues.apache.org/jira/browse/SPARK-24852
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> Port spark.ml code to use the new methods on the `Instrumentation` class and 
> remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24852:


Assignee: Apache Spark

> Have spark.ml training use updated `Instrumentation` APIs.
> --
>
> Key: SPARK-24852
> URL: https://issues.apache.org/jira/browse/SPARK-24852
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>Priority: Major
>
> Port spark.ml code to use the new methods on the `Instrumentation` class and 
> remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-18 Thread Bago Amirbekian (JIRA)

Bago Amirbekian created SPARK-24852:
---

 Summary: Have spark.ml training use updated `Instrumentation` APIs.
 Key: SPARK-24852
 URL: https://issues.apache.org/jira/browse/SPARK-24852
 Project: Spark
  Issue Type: Story
  Components: ML
Affects Versions: 2.4.0
Reporter: Bago Amirbekian


Port spark.ml code to use the new methods on the `Instrumentation` class and 
remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.

2018-07-18 Thread Kyle Prifogle (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548135#comment-16548135
 ] 

Kyle Prifogle commented on SPARK-12126:
---

Whats the hold up on this?  I've noticed that the PR has been closed.

In the case of pushing down `limit` it seems fairly straightforward to modify 
the query to append a limit before executing it.

> JDBC datasource processes filters only commonly pushed down.
> 
>
> Key: SPARK-12126
> URL: https://issues.apache.org/jira/browse/SPARK-12126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Major
>
> As suggested 
> [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646],
>  Currently JDBC datasource only processes the filters pushed down from 
> {{DataSourceStrategy}}.
> Unlike ORC or Parquet, this can process pretty a lot of filters (for example, 
> a + b > 3) since it is just about string parsing.
> As 
> [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526],
>  using {{CatalystScan}} trait might be one of solutions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21261) SparkSQL regexpExpressions example

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548119#comment-16548119
 ] 

Apache Spark commented on SPARK-21261:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/21808

> SparkSQL regexpExpressions example 
> ---
>
> Key: SPARK-21261
> URL: https://issues.apache.org/jira/browse/SPARK-21261
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 2.1.1
>Reporter: zhangxin
>Priority: Major
>
> The follow execute result.
> scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') 
> """).show
> +--+
> |regexp_replace(100-200, (d+), num)|
> +--+
> |   100-200|
> +--+
> scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') 
> """).show
> +---+
> |regexp_replace(100-200, (\d+), num)|
> +---+
> |num-num|
> +---+
> Add Comment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI

2018-07-18 Thread Parth Gandhi (JIRA)

Parth Gandhi created SPARK-24851:


 Summary: Map a Stage ID to it's Associated Job ID in UI
 Key: SPARK-24851
 URL: https://issues.apache.org/jira/browse/SPARK-24851
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1, 2.3.0
Reporter: Parth Gandhi


It would be nice to have a field in Stage Page UI which would show mapping of 
the current stage id to the job id's to which that stage belongs to. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548087#comment-16548087
 ] 

Apache Spark commented on SPARK-24536:
--

User 'mauropalsgraaf' has created a pull request for this issue:
https://github.com/apache/spark/pull/21807

> Query with nonsensical LIMIT hits AssertionError
> 
>
> Key: SPARK-24536
> URL: https://issues.apache.org/jira/browse/SPARK-24536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Behm
>Priority: Trivial
>  Labels: beginner, spree
>
> SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT)
> fails in the QueryPlanner with:
> {code}
> java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
> {code}
> I think this issue should be caught earlier during semantic analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24536:


Assignee: (was: Apache Spark)

> Query with nonsensical LIMIT hits AssertionError
> 
>
> Key: SPARK-24536
> URL: https://issues.apache.org/jira/browse/SPARK-24536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Behm
>Priority: Trivial
>  Labels: beginner, spree
>
> SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT)
> fails in the QueryPlanner with:
> {code}
> java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
> {code}
> I think this issue should be caught earlier during semantic analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24536:


Assignee: Apache Spark

> Query with nonsensical LIMIT hits AssertionError
> 
>
> Key: SPARK-24536
> URL: https://issues.apache.org/jira/browse/SPARK-24536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Behm
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: beginner, spree
>
> SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT)
> fails in the QueryPlanner with:
> {code}
> java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
> {code}
> I think this issue should be caught earlier during semantic analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24846) Stabilize expression cannonicalization

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548041#comment-16548041
 ] 

Apache Spark commented on SPARK-24846:
--

User 'gvr' has created a pull request for this issue:
https://github.com/apache/spark/pull/21806

> Stabilize expression cannonicalization
> --
>
> Key: SPARK-24846
> URL: https://issues.apache.org/jira/browse/SPARK-24846
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: spree
>
> Spark plan canonicalization is can be non-deterministic between different 
> versions of spark due to the fact that {{ExprId}} uses a UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24846) Stabilize expression cannonicalization

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24846:


Assignee: Apache Spark

> Stabilize expression cannonicalization
> --
>
> Key: SPARK-24846
> URL: https://issues.apache.org/jira/browse/SPARK-24846
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>  Labels: spree
>
> Spark plan canonicalization is can be non-deterministic between different 
> versions of spark due to the fact that {{ExprId}} uses a UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24846) Stabilize expression cannonicalization

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24846:


Assignee: (was: Apache Spark)

> Stabilize expression cannonicalization
> --
>
> Key: SPARK-24846
> URL: https://issues.apache.org/jira/browse/SPARK-24846
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: spree
>
> Spark plan canonicalization is can be non-deterministic between different 
> versions of spark due to the fact that {{ExprId}} uses a UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24850) Query plan string representation grows exponentially on queries with recursive cached datasets

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24850:


Assignee: (was: Apache Spark)

> Query plan string representation grows exponentially on queries with 
> recursive cached datasets
> --
>
> Key: SPARK-24850
> URL: https://issues.apache.org/jira/browse/SPARK-24850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Onur Satici
>Priority: Major
>
> As of [https://github.com/apache/spark/pull/21018], InMemoryRelation includes 
> its cacheBuilder when logging query plans. This CachedRDDBuilder includes the 
> cachedPlan, so calling treeString on InMemoryRelation will log the cachedPlan 
> in the cacheBuilder.
> Given the sample dataset:
> {code:java}
> $ cat test.csv
> A,B
> 0,0{code}
> If the query plan has multiple cached datasets that depend on each other:
> {code:java}
> var df_cached = spark.read.format("csv").option("header", 
> "true").load("test.csv").cache()
> 0 to 1 foreach { _ =>
> df_cached = df_cached.join(spark.read.format("csv").option("header", 
> "true").load("test.csv"), "A").cache()
> }
> df_cached.explain
> {code}
> results in:
> {code:java}
> == Physical Plan ==
> InMemoryTableScan [A#10, B#11, B#35, B#87]
> +- InMemoryRelation [A#10, B#11, B#35, B#87], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(2) Project [A#10, B#11, B#35, B#87]
> +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
> :- *(2) Filter isnotnull(A#10)
> : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
> : +- InMemoryRelation [A#10, B#11, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(2) Project [A#10, B#11, B#35]
> +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
> :- *(2) Filter isnotnull(A#10)
> : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
> : +- InMemoryRelation [A#10, B#11], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> +- *(1) Filter isnotnull(A#34)
> +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
> +- InMemoryRelation [A#34, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> ,None)
> : +- *(2) Project [A#10, B#11, B#35]
> : +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
> : :- *(2) Filter isnotnull(A#10)
> : : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
> : : +- InMemoryRelation [A#10, B#11], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> : +- *(1) Filter isnotnull(A#34)
> : +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
> : +- InMemoryRelation [A#34, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> +- *(1) Filter isnotnull(A#86)
> +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
> +- InMemoryRelation [A#86, B#87], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV,

[jira] [Commented] (SPARK-24850) Query plan string representation grows exponentially on queries with recursive cached datasets

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547996#comment-16547996
 ] 

Apache Spark commented on SPARK-24850:
--

User 'onursatici' has created a pull request for this issue:
https://github.com/apache/spark/pull/21805

> Query plan string representation grows exponentially on queries with 
> recursive cached datasets
> --
>
> Key: SPARK-24850
> URL: https://issues.apache.org/jira/browse/SPARK-24850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Onur Satici
>Priority: Major
>
> As of [https://github.com/apache/spark/pull/21018], InMemoryRelation includes 
> its cacheBuilder when logging query plans. This CachedRDDBuilder includes the 
> cachedPlan, so calling treeString on InMemoryRelation will log the cachedPlan 
> in the cacheBuilder.
> Given the sample dataset:
> {code:java}
> $ cat test.csv
> A,B
> 0,0{code}
> If the query plan has multiple cached datasets that depend on each other:
> {code:java}
> var df_cached = spark.read.format("csv").option("header", 
> "true").load("test.csv").cache()
> 0 to 1 foreach { _ =>
> df_cached = df_cached.join(spark.read.format("csv").option("header", 
> "true").load("test.csv"), "A").cache()
> }
> df_cached.explain
> {code}
> results in:
> {code:java}
> == Physical Plan ==
> InMemoryTableScan [A#10, B#11, B#35, B#87]
> +- InMemoryRelation [A#10, B#11, B#35, B#87], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(2) Project [A#10, B#11, B#35, B#87]
> +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
> :- *(2) Filter isnotnull(A#10)
> : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
> : +- InMemoryRelation [A#10, B#11, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(2) Project [A#10, B#11, B#35]
> +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
> :- *(2) Filter isnotnull(A#10)
> : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
> : +- InMemoryRelation [A#10, B#11], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> +- *(1) Filter isnotnull(A#34)
> +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
> +- InMemoryRelation [A#34, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> ,None)
> : +- *(2) Project [A#10, B#11, B#35]
> : +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
> : :- *(2) Filter isnotnull(A#10)
> : : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
> : : +- InMemoryRelation [A#10, B#11], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> : +- *(1) Filter isnotnull(A#34)
> : +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
> : +- InMemoryRelation [A#34, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> +- *(1) Filter isnotnull(A#86)
> +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
> +- InMemoryRelation [A#86, B#87], 
> CachedRDDBuilder(true,1

[jira] [Assigned] (SPARK-24850) Query plan string representation grows exponentially on queries with recursive cached datasets

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24850:


Assignee: Apache Spark

> Query plan string representation grows exponentially on queries with 
> recursive cached datasets
> --
>
> Key: SPARK-24850
> URL: https://issues.apache.org/jira/browse/SPARK-24850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Onur Satici
>Assignee: Apache Spark
>Priority: Major
>
> As of [https://github.com/apache/spark/pull/21018], InMemoryRelation includes 
> its cacheBuilder when logging query plans. This CachedRDDBuilder includes the 
> cachedPlan, so calling treeString on InMemoryRelation will log the cachedPlan 
> in the cacheBuilder.
> Given the sample dataset:
> {code:java}
> $ cat test.csv
> A,B
> 0,0{code}
> If the query plan has multiple cached datasets that depend on each other:
> {code:java}
> var df_cached = spark.read.format("csv").option("header", 
> "true").load("test.csv").cache()
> 0 to 1 foreach { _ =>
> df_cached = df_cached.join(spark.read.format("csv").option("header", 
> "true").load("test.csv"), "A").cache()
> }
> df_cached.explain
> {code}
> results in:
> {code:java}
> == Physical Plan ==
> InMemoryTableScan [A#10, B#11, B#35, B#87]
> +- InMemoryRelation [A#10, B#11, B#35, B#87], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(2) Project [A#10, B#11, B#35, B#87]
> +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
> :- *(2) Filter isnotnull(A#10)
> : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
> : +- InMemoryRelation [A#10, B#11, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(2) Project [A#10, B#11, B#35]
> +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
> :- *(2) Filter isnotnull(A#10)
> : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
> : +- InMemoryRelation [A#10, B#11], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> +- *(1) Filter isnotnull(A#34)
> +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
> +- InMemoryRelation [A#34, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> ,None)
> : +- *(2) Project [A#10, B#11, B#35]
> : +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
> : :- *(2) Filter isnotnull(A#10)
> : : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
> : : +- InMemoryRelation [A#10, B#11], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> : +- *(1) Filter isnotnull(A#34)
> : +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
> : +- InMemoryRelation [A#34, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> +- *(1) Filter isnotnull(A#86)
> +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
> +- InMemoryRelation [A#86, B#87], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batch

[jira] [Created] (SPARK-24850) Query plan string representation grows exponentially on queries with recursive cached datasets

2018-07-18 Thread Onur Satici (JIRA)

Onur Satici created SPARK-24850:
---

 Summary: Query plan string representation grows exponentially on 
queries with recursive cached datasets
 Key: SPARK-24850
 URL: https://issues.apache.org/jira/browse/SPARK-24850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Onur Satici


As of [https://github.com/apache/spark/pull/21018], InMemoryRelation includes 
its cacheBuilder when logging query plans. This CachedRDDBuilder includes the 
cachedPlan, so calling treeString on InMemoryRelation will log the cachedPlan 
in the cacheBuilder.

Given the sample dataset:
{code:java}
$ cat test.csv
A,B
0,0{code}
If the query plan has multiple cached datasets that depend on each other:
{code:java}
var df_cached = spark.read.format("csv").option("header", 
"true").load("test.csv").cache()
0 to 1 foreach { _ =>
df_cached = df_cached.join(spark.read.format("csv").option("header", 
"true").load("test.csv"), "A").cache()
}
df_cached.explain

{code}
results in:
{code:java}
== Physical Plan ==
InMemoryTableScan [A#10, B#11, B#35, B#87]
+- InMemoryRelation [A#10, B#11, B#35, B#87], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) Project [A#10, B#11, B#35, B#87]
+- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
:- *(2) Filter isnotnull(A#10)
: +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
: +- InMemoryRelation [A#10, B#11, B#35], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) Project [A#10, B#11, B#35]
+- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
:- *(2) Filter isnotnull(A#10)
: +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
: +- InMemoryRelation [A#10, B#11], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
,None)
: +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
+- *(1) Filter isnotnull(A#34)
+- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
+- InMemoryRelation [A#34, B#35], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
,None)
+- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
,None)
: +- *(2) Project [A#10, B#11, B#35]
: +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
: :- *(2) Filter isnotnull(A#10)
: : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
: : +- InMemoryRelation [A#10, B#11], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
,None)
: : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
false]))
: +- *(1) Filter isnotnull(A#34)
: +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
: +- InMemoryRelation [A#34, B#35], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
,None)
: +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
+- *(1) Filter isnotnull(A#86)
+- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
+- InMemoryRelation [A#86, B#87], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
,None)
+- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct
,None)
+- *(2) Project [A#10, B#11, B#35, B#87]
+- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
:- *(2) Filter isnotnull(A#10)
: +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(

[jira] [Commented] (SPARK-24268) DataType in error messages are not coherent

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547939#comment-16547939
 ] 

Apache Spark commented on SPARK-24268:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21804

> DataType in error messages are not coherent
> ---
>
> Key: SPARK-24268
> URL: https://issues.apache.org/jira/browse/SPARK-24268
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
>
> In SPARK-22893 there was a tentative to unify the way dataTypes are reported 
> in error messages. There, we decided to use always {{dataType.simpleString}}. 
> Unfortunately, we missed many places where this still needed to be fixed. 
> Moreover, it turns out that the right method to use is not {{simpleString}}, 
> but we should use {{catalogString}} instead (for further details please check 
> the discussion in the PR https://github.com/apache/spark/pull/21321).
> So we should update all the missing places in order to provide error messages 
> coherently throughout the project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24849) Convert StructType to DDL string

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547919#comment-16547919
 ] 

Apache Spark commented on SPARK-24849:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21803

> Convert StructType to DDL string
> 
>
> Key: SPARK-24849
> URL: https://issues.apache.org/jira/browse/SPARK-24849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new methods which should convert a value of StructType to a 
> schema in DDL format . It should be possible to use the former string in new 
> table creation by just copy-pasting of new method results. The existing 
> methods simpleString(), catalogString() and sql() put ':' between top level 
> field name and its type, and wrap by the *struct* word
> {code}
> ds.schema.catalogString
> struct {code}
> Output of new method should be
> {code}
> metaData struct {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24849) Convert StructType to DDL string

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24849:


Assignee: Apache Spark

> Convert StructType to DDL string
> 
>
> Key: SPARK-24849
> URL: https://issues.apache.org/jira/browse/SPARK-24849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Need to add new methods which should convert a value of StructType to a 
> schema in DDL format . It should be possible to use the former string in new 
> table creation by just copy-pasting of new method results. The existing 
> methods simpleString(), catalogString() and sql() put ':' between top level 
> field name and its type, and wrap by the *struct* word
> {code}
> ds.schema.catalogString
> struct {code}
> Output of new method should be
> {code}
> metaData struct {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24849) Convert StructType to DDL string

2018-07-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24849:


Assignee: (was: Apache Spark)

> Convert StructType to DDL string
> 
>
> Key: SPARK-24849
> URL: https://issues.apache.org/jira/browse/SPARK-24849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new methods which should convert a value of StructType to a 
> schema in DDL format . It should be possible to use the former string in new 
> table creation by just copy-pasting of new method results. The existing 
> methods simpleString(), catalogString() and sql() put ':' between top level 
> field name and its type, and wrap by the *struct* word
> {code}
> ds.schema.catalogString
> struct {code}
> Output of new method should be
> {code}
> metaData struct {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24628) Typos of the example code in docs/mllib-data-types.md

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24628:
-

Assignee: Weizhe Huang

> Typos of the example code in docs/mllib-data-types.md
> -
>
> Key: SPARK-24628
> URL: https://issues.apache.org/jira/browse/SPARK-24628
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Weizhe Huang
>Assignee: Weizhe Huang
>Priority: Minor
> Fix For: 2.4.0
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24628) Typos of the example code in docs/mllib-data-types.md

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24628.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21612
[https://github.com/apache/spark/pull/21612]

> Typos of the example code in docs/mllib-data-types.md
> -
>
> Key: SPARK-24628
> URL: https://issues.apache.org/jira/browse/SPARK-24628
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Weizhe Huang
>Assignee: Weizhe Huang
>Priority: Minor
> Fix For: 2.4.0
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24093.
---
Resolution: Won't Fix

> Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to 
> outside of the classes
> ---
>
> Key: SPARK-24093
> URL: https://issues.apache.org/jira/browse/SPARK-24093
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Weiqing Yang
>Priority: Minor
>
> To make third parties able to get the information of streaming writer, for 
> example, the information of "writer" and "topic" which streaming data are 
> written into, this jira is created to make relevant fields of 
> KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the 
> classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-18 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24804.
---
   Resolution: Fixed
 Assignee: hantiantian
Fix Version/s: 2.4.0

This is too trivial for a Jira [~hantiantian], but OK for a first contribution.

Resolved by https://github.com/apache/spark/pull/21767

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Trivial
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24849) Convert StructType to DDL string

2018-07-18 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547906#comment-16547906
 ] 

Takeshi Yamamuro commented on SPARK-24849:
--

What is this new func used for? Is this the sub-ticket of another work?

> Convert StructType to DDL string
> 
>
> Key: SPARK-24849
> URL: https://issues.apache.org/jira/browse/SPARK-24849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new methods which should convert a value of StructType to a 
> schema in DDL format . It should be possible to use the former string in new 
> table creation by just copy-pasting of new method results. The existing 
> methods simpleString(), catalogString() and sql() put ':' between top level 
> field name and its type, and wrap by the *struct* word
> {code}
> ds.schema.catalogString
> struct {code}
> Output of new method should be
> {code}
> metaData struct {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-18 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547903#comment-16547903
 ] 

Thomas Graves commented on SPARK-24615:
---

did the design doc permissions change? I can't seem to access it now.

A few overall concerns. We are now making accelerator configurations available 
per stage, but what about cpu and memory?  It seems like if we are going to 
start making things configurable at the stage/rdd level it would be nice to be 
consistent.  People have asked for this ability in the past.

What about the case where to run some ML algorithm you would want machines of 
different types?  For instance tensorflow with a parameter server might want 
gpu nodes for the workers but the parameter server would just be a cpu.  This 
would also apply to the barrier scheduler so might cross post there. 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24796) Support GROUPED_AGG_PANDAS_UDF in Pivot

2018-07-18 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547901#comment-16547901
 ] 

Xiao Li commented on SPARK-24796:
-

[~icexelloss] Thank you!

> Support GROUPED_AGG_PANDAS_UDF in Pivot
> ---
>
> Key: SPARK-24796
> URL: https://issues.apache.org/jira/browse/SPARK-24796
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, Grouped AGG PandasUDF is not supported in Pivot. It is nice to 
> support it. 
> {code}
> # create input dataframe
> from pyspark.sql import Row
> data = [
>   Row(id=123, total=200.0, qty=3, name='item1'),
>   Row(id=124, total=1500.0, qty=1, name='item2'),
>   Row(id=125, total=203.5, qty=2, name='item3'),
>   Row(id=126, total=200.0, qty=500, name='item1'),
> ]
> df = spark.createDataFrame(data)
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def pandas_avg(v):
>return v.mean()
> from pyspark.sql.functions import col, sum
>   
> applied_df = 
> df.groupby('id').pivot('name').agg(pandas_avg('total').alias('mean'))
> applied_df.show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24849) Convert StructType to DDL string

2018-07-18 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547876#comment-16547876
 ] 

Maxim Gekk commented on SPARK-24849:


I am working on the ticket.

> Convert StructType to DDL string
> 
>
> Key: SPARK-24849
> URL: https://issues.apache.org/jira/browse/SPARK-24849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new methods which should convert a value of StructType to a 
> schema in DDL format . It should be possible to use the former string in new 
> table creation by just copy-pasting of new method results. The existing 
> methods simpleString(), catalogString() and sql() put ':' between top level 
> field name and its type, and wrap by the *struct* word
> {code}
> ds.schema.catalogString
> struct {code}
> Output of new method should be
> {code}
> metaData struct {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24849) Convert StructType to DDL string

2018-07-18 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-24849:
--

 Summary: Convert StructType to DDL string
 Key: SPARK-24849
 URL: https://issues.apache.org/jira/browse/SPARK-24849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to add new methods which should convert a value of StructType to a schema 
in DDL format . It should be possible to use the former string in new table 
creation by just copy-pasting of new method results. The existing methods 
simpleString(), catalogString() and sql() put ':' between top level field name 
and its type, and wrap by the *struct* word

{code}
ds.schema.catalogString
struct

[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2018-07-18 Thread Li Yuanjian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547873#comment-16547873
 ] 

Li Yuanjian commented on SPARK-24295:
-

Thanks for your detailed explain. 
You can check this: SPARK-17604, seems like the same requirements about purging 
the compact aged file. The small difference is we need the purge logic in 
FileStreamSinkLog while the jira support in FileSourceSinkLog, but I think the 
strategy can be reused. Also cc the original author [~jerryshao2015] for 
SPARK-17604. 


> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24844) spark REST API need to add ipFilter

2018-07-18 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-24844:
-
Priority: Minor  (was: Blocker)

> spark REST API need to add ipFilter
> ---
>
> Key: SPARK-24844
> URL: https://issues.apache.org/jira/browse/SPARK-24844
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: all server
>Reporter: daijiacheng
>Priority: Minor
>
>  Spark has a hidden REST API which handles application submission, status 
> checking and cancellation. But, It can't allowed specify ip, when I open't 
> this function, My server may be attacked. It need to add ipFilter to filter 
> some ip



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24844) spark REST API need to add ipFilter

2018-07-18 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547831#comment-16547831
 ] 

Takeshi Yamamuro commented on SPARK-24844:
--

'Blocker' tag in priority is reserved for committers.

> spark REST API need to add ipFilter
> ---
>
> Key: SPARK-24844
> URL: https://issues.apache.org/jira/browse/SPARK-24844
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: all server
>Reporter: daijiacheng
>Priority: Minor
>
>  Spark has a hidden REST API which handles application submission, status 
> checking and cancellation. But, It can't allowed specify ip, when I open't 
> this function, My server may be attacked. It need to add ipFilter to filter 
> some ip



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array

2018-07-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547781#comment-16547781
 ] 

Apache Spark commented on SPARK-23928:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21802

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24843) Spark2 job (in cluster mode) is unable to execute steps in HBase (error# java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/CompatibilityFactory)

2018-07-18 Thread Manish (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547774#comment-16547774
 ] 

Manish commented on SPARK-24843:


Thanks Wang.

I am setting it using export command before firing spark2-submit. It works 
perfectly fine with client mode but not working in cluster mode. Any leads will 
be very helpful to me.

{color:#205081}export 
HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf:/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/hbase-common-1.2.0-cdh5.11.1.jar:/home/svc-cop-realtime-d/scala1/jar/lib/hbase-rdd_2.11-0.8.0.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/hbase-hadoop2-compat-1.2.0-cdh5.11.1.jar{color}

> Spark2 job (in cluster mode) is unable to execute steps in HBase (error# 
> java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/CompatibilityFactory)
> --
>
> Key: SPARK-24843
> URL: https://issues.apache.org/jira/browse/SPARK-24843
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Java API
>Affects Versions: 2.1.0
>Reporter: Manish
>Priority: Major
>
> I am running Spark2 streaming job to do processing in HBase. It wokrs 
> perfectly fine with client deploy mode but don't work with deploy mode as 
> cluster . Below is the error message:
> |{color:#ff}_User class threw exception: java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hbase/CompatibilityFactory_{color}|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24848) When a stage fails onStageCompleted is called before onTaskEnd

2018-07-18 Thread Yavgeni Hotimsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yavgeni Hotimsky updated SPARK-24848:
-
Description: 
It seems that when a stage fails because one of it's tasks failed too many 
times the onStageCompleted callback of the SparkListener is called before the 
onTaskEnd listener for the failing task. We're using structured streaming in 
this case.

We noticed this because we built a listener to track the precise number of 
active tasks to be exported as a metric and was using the stage callback to 
maintain a map from stage ids to some metadata extracted from the jobGroupId. 
The onStageCompleted listener was removing from the map to prevent unbounded 
memory usage and in this case I could see the onTaskEnd callback was being 
called after the onStageCompleted callback so it couldn't find the stageId in 
the map. We worked around it by replacing the map with a timed cache.

  was:
It seems that when a stage fails because one of it's tasks failed too many 
times the onStageCompleted callback of the SparkListener is called before the 
onTaskEnd listener for the failing task. We're using structured streaming in 
this case.

We noticed this because we built a listener to track the precise number of 
active tasks per one of my processes to be exported as a metric and was using 
the stage callback to maintain a map from stage ids to some metadata extracted 
from the jobGroupId. The onStageCompleted listener was removing from the map to 
prevent unbounded memory and in this case I could see the onTaskEnd callback 
was being called after the onStageCompleted callback so it couldn't find the 
stageId in the map. We worked around it by replacing the map with a timed cache.


> When a stage fails onStageCompleted is called before onTaskEnd
> --
>
> Key: SPARK-24848
> URL: https://issues.apache.org/jira/browse/SPARK-24848
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yavgeni Hotimsky
>Priority: Minor
>
> It seems that when a stage fails because one of it's tasks failed too many 
> times the onStageCompleted callback of the SparkListener is called before the 
> onTaskEnd listener for the failing task. We're using structured streaming in 
> this case.
> We noticed this because we built a listener to track the precise number of 
> active tasks to be exported as a metric and was using the stage callback to 
> maintain a map from stage ids to some metadata extracted from the jobGroupId. 
> The onStageCompleted listener was removing from the map to prevent unbounded 
> memory usage and in this case I could see the onTaskEnd callback was being 
> called after the onStageCompleted callback so it couldn't find the stageId in 
> the map. We worked around it by replacing the map with a timed cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24848) When a stage fails onStageCompleted is called before onTaskEnd

2018-07-18 Thread Yavgeni Hotimsky (JIRA)

Yavgeni Hotimsky created SPARK-24848:


 Summary: When a stage fails onStageCompleted is called before 
onTaskEnd
 Key: SPARK-24848
 URL: https://issues.apache.org/jira/browse/SPARK-24848
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Yavgeni Hotimsky


It seems that when a stage fails because one of it's tasks failed too many 
times the onStageCompleted callback of the SparkListener is called before the 
onTaskEnd listener for the failing task. We're using structured streaming in 
this case.

We noticed this because we built a listener to track the precise number of 
active tasks per one of my processes to be exported as a metric and was using 
the stage callback to maintain a map from stage ids to some metadata extracted 
from the jobGroupId. The onStageCompleted listener was removing from the map to 
prevent unbounded memory and in this case I could see the onTaskEnd callback 
was being called after the onStageCompleted callback so it couldn't find the 
stageId in the map. We worked around it by replacing the map with a timed cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24847) ScalaReflection#schemaFor occasionally fails to detect schema for Seq of type alias

2018-07-18 Thread Ahmed Mahran (JIRA)

Ahmed Mahran created SPARK-24847:


 Summary: ScalaReflection#schemaFor occasionally fails to detect 
schema for Seq of type alias
 Key: SPARK-24847
 URL: https://issues.apache.org/jira/browse/SPARK-24847
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ahmed Mahran


org.apache.spark.sql.catalyst.ScalaReflection#schemaFor occasionally fails to 
detect schema for Seq of type alias (and it occasionally succeeds).

 
{code:java}
object Types {
  type Alias1 = Long
  type Alias2 = Int
  type Alias3 = Int
}

case class B(b1: Alias1, b2: Seq[Alias2], b3: Option[Alias3])
case class A(a1: B, a2: Int)
{code}
 
{code}
import sparkSession.implicits._

val seq = Seq(
  A(B(2L, Seq(3), Some(1)), 1),
  A(B(3L, Seq(2), Some(2)), 2)
)

val ds = sparkSession.createDataset(seq)
{code}
 
{code:java}
java.lang.UnsupportedOperationException: Schema for type Seq[Types.Alias2] is 
not supported at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:780)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:715)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:714)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:381)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:391)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:138)
 at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72)
 at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) at 
org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248)
 at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34)

 {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18600) BZ2 CRC read error needs better reporting

2018-07-18 Thread Herman van Hovell (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-18600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18600:
--
Labels: spree  (was: )

> BZ2 CRC read error needs better reporting
> -
>
> Key: SPARK-18600
> URL: https://issues.apache.org/jira/browse/SPARK-18600
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Charles R Allen
>Priority: Minor
>  Labels: spree
>
> {code}
> 16/11/25 20:05:03 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 148 
> in stage 5.0 failed 1 times, most recent failure: Lost task 148.0 in stage 
> 5.0 (TID 5945, localhost): org.apache.spark.SparkException: Task failed while 
> writing rows
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalStateException - Error reading from input
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=[INTERVALSTARTTIME_GMT, INTERVALENDTIME_GMT, OPR_DT, OPR_HR, 
> NODE_ID_XML, NODE_ID, NODE, MARKET_RUN_ID, LMP_TYPE, XML_DATA_ITEM, 
> PNODE_RESMRID, GRP_TYPE, POS, VALUE, OPR_INTERVAL, GROUP]
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Line separator detection enabled=false
> Maximum number of characters per column=100
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Row processor=none
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=\0
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=27089, column=13, record=27089, 
> charIndex=4451456, headers=[INTERVALSTARTTIME_GMT, INTERVALENDTIME_GMT, 
> OPR_DT, OPR_HR, NODE_ID_XML, NODE_ID, NODE, MARKET_RUN_ID, LMP_TYPE, 
> XML_DATA_ITEM, PNODE_RESMRID, GRP_TYPE, POS, VALUE, OPR_INTERVAL, GROUP]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execu

[jira] [Updated] (SPARK-23612) Specify formats for individual DateType and TimestampType columns in schemas

2018-07-18 Thread Herman van Hovell (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-23612:
--
Labels: DataType date spree sql  (was: DataType date sql)

> Specify formats for individual DateType and TimestampType columns in schemas
> 
>
> Key: SPARK-23612
> URL: https://issues.apache.org/jira/browse/SPARK-23612
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Patrick Young
>Priority: Minor
>  Labels: DataType, date, spree, sql
>
> [https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200]
> It would be very helpful if it were possible to specify the format for 
> individual columns in a schema when reading csv files, rather than one format:
> {code:java|title=Bar.python|borderStyle=solid}
> # Currently can only do something like:
> spark.read.option("dateFormat", "MMdd").csv(...) 
> # Would like to be able to do something like:
> schema = StructType([
>     StructField("date1", DateType(format="MM/dd/"), True),
>     StructField("date2", DateType(format="MMdd"), True)
> ]
> read.schema(schema).csv(...)
> {code}
> Thanks for any help, input!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2018-07-18 Thread Herman van Hovell (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-24838:
--
Labels: spree  (was: )

> Support uncorrelated IN/EXISTS subqueries for more operators 
> -
>
> Key: SPARK-24838
> URL: https://issues.apache.org/jira/browse/SPARK-24838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Qifan Pu
>Priority: Major
>  Labels: spree
>
> Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
> Running a query:
> {{select name in (select * from valid_names)}}
> {{from all_names}}
> returns error:
> {code:java}
> Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries 
> can only be used in a Filter
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24846) Stabilize expression cannonicalization

2018-07-18 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-24846:
-

 Summary: Stabilize expression cannonicalization
 Key: SPARK-24846
 URL: https://issues.apache.org/jira/browse/SPARK-24846
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Herman van Hovell


Spark plan canonicalization is can be non-deterministic between different 
versions of spark due to the fact that {{ExprId}} uses a UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError

2018-07-18 Thread Herman van Hovell (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-24536:
--
Labels: beginner spree  (was: beginner)

> Query with nonsensical LIMIT hits AssertionError
> 
>
> Key: SPARK-24536
> URL: https://issues.apache.org/jira/browse/SPARK-24536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Behm
>Priority: Trivial
>  Labels: beginner, spree
>
> SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT)
> fails in the QueryPlanner with:
> {code}
> java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
> {code}
> I think this issue should be caught earlier during semantic analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 110 matches

Mail list logo