date:20191007

[jira] [Created] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'

2019-10-07 Thread angerszhu (Jira)

angerszhu created SPARK-29379:
-

 Summary: SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
 Key: SPARK-29379
 URL: https://issues.apache.org/jira/browse/SPARK-29379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu


SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24640) size(null) returns null

2019-10-07 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946503#comment-16946503
 ] 

Maxim Gekk commented on SPARK-24640:


As far as I remember we planed to remove spark.sql.legacy.sizeOfNull in 3.0. 
[~hyukjin.kwon] [~smilegator] This ticket is a remainder of this.

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>  Labels: api, bulk-closed
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25008) Add memory mode info to showMemoryUsage in TaskMemoryManager

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25008.
--
Resolution: Incomplete

> Add memory mode info to showMemoryUsage in TaskMemoryManager
> 
>
> Key: SPARK-25008
> URL: https://issues.apache.org/jira/browse/SPARK-25008
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Priority: Major
>  Labels: bulk-closed
>
> TaskMemoryManager prints the current memory usage information before throwing 
> an OOM exception which is helpful in debugging issues. This log does not have 
> the memory mode information which can be also useful to quickly determine 
> which memory users need to increase.
> This JIRA is to add that information to showMemoryUsage method of 
> TaskMemoryManager.
> Current logs:
> {code}
> 18/07/03 17:57:16 INFO memory.TaskMemoryManager: Memory used in task 318
> 18/07/03 17:57:16 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@7f084d1b: 
> 1024.0 KB
> 18/07/03 17:57:16 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@713d50f2: 32.0 KB
> 18/07/03 17:57:16 INFO memory.TaskMemoryManager: 0 bytes of memory were used 
> by task 318 but are not associated with specific consumers
> 18/07/03 17:57:16 INFO memory.TaskMemoryManager: 1081344 bytes of memory are 
> used for execution and 306201016 bytes of memory are used for storage
> 18/07/03 17:57:16 ERROR executor.Executor: Exception in task 86.0 in stage 
> 49.0 (TID 318)
> java.lang.OutOfMemoryError: Unable to acquire 326284160 bytes of memory, got 
> 3112960
>  at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127)
>  at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359)
>  at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:382)
>  at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
>  at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>  at org.apache.spark.scheduler.Task.run(Task.scala:108)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25230) Upper behavior incorrect for string contains "ß"

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25230.
--
Resolution: Incomplete

> Upper behavior incorrect for string contains "ß"
> 
>
> Key: SPARK-25230
> URL: https://issues.apache.org/jira/browse/SPARK-25230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yuming Wang
>Priority: Major
>  Labels: bulk-closed
> Attachments: MySQL.png, Oracle.png, Teradata.jpeg
>
>
> How to reproduce:
> {code:sql}
> spark-sql> SELECT upper('Haßler');
> HASSLER
> {code}
> Mainstream databases returns {{HAßLER}}.
>  !MySQL.png!
>  
> This behavior may lead to data inconsistency:
> {code:sql}
> create temporary view SPARK_25230 as select * from values
>   ("Hassler"),
>   ("Haßler")
> as EMPLOYEE(name);
> select UPPER(name) from SPARK_25230 group by 1;
> -- result
> HASSLER{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24074) Maven package resolver downloads javadoc instead of jar

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24074.
--
Resolution: Incomplete

> Maven package resolver downloads javadoc instead of jar
> ---
>
> Key: SPARK-24074
> URL: https://issues.apache.org/jira/browse/SPARK-24074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Nadav Samet
>Priority: Major
>  Labels: bulk-closed
>
> {code:java}
> // code placeholder
> {code}
> From some reason spark downloads a javadoc artifact of a package instead of 
> the jar.
> Steps to reproduce:
>  # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
> fetch artifacts from central:
> {code:java}
> rm -rf ~/.ivy2
> {code}
> 1. Run:
> {code:java}
> ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
> org.scalanlp:breeze_2.11:0.13.2{code}
> 2.Spark would download the javadoc instead of the jar:
> {code:java}
> downloading 
> https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
>  ...
> [SUCCESSFUL ] 
> net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
> (610ms){code}
> 3. Later spark would complain that it couldn't find the jar:
> {code:java}
> Warning: Local jar 
> /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar 
> does not exist, skipping.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).{code}
> 4. The dependency of breeze on f2j_arpack_combined seem fine: 
> [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25165) Cannot parse Hive Struct

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25165.
--
Resolution: Incomplete

> Cannot parse Hive Struct
> 
>
> Key: SPARK-25165
> URL: https://issues.apache.org/jira/browse/SPARK-25165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Frank Yin
>Priority: Major
>  Labels: bulk-closed
>
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct,view.b:array>
>  
> My guess is dot(.) is causing issues for parsing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22132) Document the Dispatcher REST API

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22132.
--
Resolution: Incomplete

> Document the Dispatcher REST API
> 
>
> Key: SPARK-22132
> URL: https://issues.apache.org/jira/browse/SPARK-22132
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
>Priority: Minor
>  Labels: bulk-closed
>
> The Dispatcher has a REST API for managing jobs in a Mesos cluster but it's 
> currently undocumented meaning that users have to reference the source code 
> for programmatic access. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24838.
--
Resolution: Incomplete

> Support uncorrelated IN/EXISTS subqueries for more operators 
> -
>
> Key: SPARK-24838
> URL: https://issues.apache.org/jira/browse/SPARK-24838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Qifan Pu
>Priority: Major
>  Labels: bulk-closed
>
> Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
> Running a query:
> {{select name in (select * from valid_names)}}
> {{from all_names}}
> returns error:
> {code:java}
> Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries 
> can only be used in a Filter
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24524) Improve aggregateMetrics: less memory usage and loops

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24524.
--
Resolution: Incomplete

> Improve aggregateMetrics: less memory usage and loops
> -
>
> Key: SPARK-24524
> URL: https://issues.apache.org/jira/browse/SPARK-24524
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: bulk-closed
>
> The function `aggregateMetrics` process metrics from both executors and 
> driver. The data can be large. 
> This PR is to improve the implementation with one loop(before converting to 
> string) and one dynamic data structure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10413) ML models should support prediction on single instances

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-10413.
--
Resolution: Incomplete

> ML models should support prediction on single instances
> ---
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>  Labels: bulk-closed
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16203.
--
Resolution: Incomplete

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>  Labels: bulk-closed
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23790.
--
Resolution: Incomplete

> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>  Labels: bulk-closed
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333] and the problem was 
> reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for 
> yarn.
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue the workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19609) Broadcast joins should pushdown join constraints as Filter to the larger relation

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19609.
--
Resolution: Incomplete

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-19609
> URL: https://issues.apache.org/jira/browse/SPARK-19609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>Priority: Major
>  Labels: bulk-closed
>
> For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. 
> An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks.
> This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5556.
-
Resolution: Incomplete

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
>Priority: Major
>  Labels: bulk-closed
> Attachments: LDA_test.xlsx, spark-summit.pptx
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23498) Accuracy problem in comparison with string and integer

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23498.
--
Resolution: Incomplete

> Accuracy problem in comparison with string and integer
> --
>
> Key: SPARK-23498
> URL: https://issues.apache.org/jira/browse/SPARK-23498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kevin Zhang
>Priority: Major
>  Labels: bulk-closed
>
> While comparing a string column with integer value, spark sql will 
> automatically cast the string operant to int, the following sql will return 
> true in hive but false in spark
>  
> {code:java}
> select '1000.1'>1000
> {code}
>  
>  from the physical plan we can see the string operant was cast to int which 
> caused the accuracy loss
> {code:java}
> *Project [false AS (CAST(1000.1 AS INT) > 1000)#4]
> +- Scan OneRowRelation[]
> {code}
> To solve it, using a wider common type like double to cast both sides of 
> operant of a binary operator may be safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24118) Support lineSep format independent from encoding

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24118.
--
Resolution: Incomplete

> Support lineSep format independent from encoding
> 
>
> Key: SPARK-24118
> URL: https://issues.apache.org/jira/browse/SPARK-24118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>  Labels: bulk-closed
>
> Currently, the lineSep option of JSON datasource is depend on encoding. It is 
> impossible to define correct lineSep for JSON files with BOM in UTF-16 and 
> UTF-32 encoding, for example. Need to propose a format of lineSep which will 
> represent sequence of octets (bytes) and will be independent from encoding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21016) Improve code fault tolerance for converting string to number

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21016.
--
Resolution: Incomplete

> Improve code fault tolerance for converting string to number
> 
>
> Key: SPARK-21016
> URL: https://issues.apache.org/jira/browse/SPARK-21016
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>  Labels: bulk-closed
>
> Converting string to number(int, long or double), if the string has a space 
> before or after,we are not easy to detect,especially the user configuration.
> For example
> conf.setConfString(key, "  20")  //has a space in string "  20"
> conf.getConf(confEntry, 5)  // This statement fails



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24221) Retry spark app submission to k8 in KubernetesClientApplication

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24221.
--
Resolution: Incomplete

> Retry spark app submission to k8 in KubernetesClientApplication
> ---
>
> Key: SPARK-24221
> URL: https://issues.apache.org/jira/browse/SPARK-24221
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yifei Huang
>Priority: Major
>  Labels: bulk-closed
>
> Following from https://issues.apache.org/jira/browse/SPARK-24135, drivers, in 
> addition to executors, could suffer from init-container failures in 
> Kubernetes. Currently, we fail the entire application if that's the case, so 
> it's up to the client to detect those errors and retry. However, since both 
> driver and executor initialization have the same failure case, it seems like 
> we're repeating logic in two places. Would it be better to consolidate this 
> retry logic in `KubernetesClientApplication`?
> We could still count executor pod initialization failures in 
> `KubernetesClusterSchedulerBackend` and decide what to do with the 
> application if there are too many failures there, but we'd be guaranteed a 
> set number of retries for each executor before giving up. Or would this be 
> too confusing and obfuscate the true number of retries? We could also 
> configure the number of driver and executor retries separately. It just seems 
> like if we're tackling init-container failure retries for executors, we 
> should also provide support for drivers as well since they suffer from the 
> same problem. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12014) Spark SQL query containing semicolon is broken in Beeline (related to HIVE-11100)

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-12014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-12014.
--
Resolution: Incomplete

> Spark SQL query containing semicolon is broken in Beeline (related to 
> HIVE-11100)
> -
>
> Key: SPARK-12014
> URL: https://issues.apache.org/jira/browse/SPARK-12014
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Teng Qiu
>Priority: Minor
>  Labels: bulk-closed
>
> Actually it is known hive issue: 
> https://issues.apache.org/jira/browse/HIVE-11100
> patch available: https://reviews.apache.org/r/35907/diff/1
> but Spark uses its own maven dependencies for hive (org.spark-project.hive), 
> we can not use this patch to fix the problem, it would be better if you can 
> fix this in spark's hive package.
> In spark's beeline, the error message will be:
> {code}
> 0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n';
> Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
> expecting StringLiteral near 'BY' in table row format's field separator; line 
> 1 pos 87 (state=,code=0)
> 0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n';
> Error: org.apache.spark.sql.AnalysisException: mismatched input '' 
> expecting StringLiteral near 'BY' in table row format's field separator; line 
> 1 pos 88 (state=,code=0)
> 0: jdbc:hive2://host:1/> SELECT 
> str_to_map(other_data,';','=')['key_name'] FROM some_logs WHERE log_date = 
> '20151125' limit 5;
> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> '' '' '' in select expression; line 1 pos 30 (state=,code=0)
> 0: jdbc:hive2://host:1/> SELECT 
> str_to_map(other_data,'\;','=')['key_name'] FROM some_logs WHERE log_date = 
> '20151125' limit 5;
> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> '' '' '' in select expression; line 1 pos 31 (state=,code=0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24051.
--
Resolution: Incomplete

> Incorrect results for certain queries using Java and Python APIs on Spark 
> 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>  Labels: bulk-closed
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21122) Address starvation issues when dynamic allocation is enabled

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21122.
--
Resolution: Incomplete

> Address starvation issues when dynamic allocation is enabled
> 
>
> Key: SPARK-21122
> URL: https://issues.apache.org/jira/browse/SPARK-21122
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Craig Ingram
>Priority: Major
>  Labels: bulk-closed
> Attachments: Preventing Starvation with Dynamic Allocation Enabled.pdf
>
>
> When dynamic resource allocation is enabled on a cluster, it’s currently 
> possible for one application to consume all the cluster’s resources, 
> effectively starving any other application trying to start. This is 
> particularly painful in a notebook environment where notebooks may be idle 
> for tens of minutes while the user is figuring out what to do next (or eating 
> their lunch). Ideally the application should give resources back to the 
> cluster when monitoring indicates other applications are pending.
> Before delving into the specifics of the solution. There are some workarounds 
> to this problem that are worth mentioning:
> * Set spark.dynamicAllocation.maxExecutors to a small value, so that users 
> are unlikely to use the entire cluster even when many of them are doing work. 
> This approach will hurt cluster utilization.
> * If using YARN, enable preemption and have each application (or 
> organization) run in a separate queue. The downside of this is that when YARN 
> preempts, it doesn't know anything about which executor it's killing. It 
> would just as likely kill a long running executor with cached data as one 
> that just spun up. Moreover, given a feature like 
> https://issues.apache.org/jira/browse/SPARK-21097 (preserving cached data on 
> executor decommission), YARN may not wait long enough between trying to 
> gracefully and forcefully shut down the executor. This would mean the blocks 
> that belonged to that executor would be lost and have to be recomputed.
> * Configure YARN to use the capacity scheduler with multiple scheduler 
> queues. Put high-priority notebook users into a high-priority queue. Prevents 
> high-priority users from being starved out by low-priority notebook users. 
> Does not prevent users in the same priority class from starving each other.
> Obviously any solution to this problem that depends on YARN would leave other 
> resource managers out in the cold. The solution proposed in this ticket will 
> afford spark clusters the flexibly to hook in different resource allocation 
> policies to fulfill their user's needs regardless of resource manager choice. 
> Initially the focus will be on users in a notebook environment. When 
> operating in a notebook environment with many users, the goal is fair 
> resource allocation. Given that all users will be using the same memory 
> configuration, this solution will focus primarily on fair sharing of cores.
> The fair resource allocation policy should pick executors to remove based on 
> three factors initially: idleness, presence of cached data, and uptime. The 
> policy will favor removing executors that are idle, short-lived, and have no 
> cached data. The policy will only preemptively remove executors if there are 
> pending applications or cores (otherwise the default dynamic allocation 
> timeout/removal process is followed). The policy will also allow an 
> application's resource consumption to expand based on cluster utilization. 
> For example if there are 3 applications running but 2 of them are idle, the 
> policy will allow a busy application with pending tasks to consume more than 
> 1/3rd of the the cluster's resources.
> More complexity could be added to take advantage of task/stage metrics, 
> histograms, and heuristics (i.e. favor removing executors running tasks that 
> are quick). The important thing here is to benchmark effectively before 
> adding complexity so we can measure the impact of the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20624) Add better handling for node shutdown

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20624.
--
Resolution: Incomplete

> Add better handling for node shutdown
> -
>
> Key: SPARK-20624
> URL: https://issues.apache.org/jira/browse/SPARK-20624
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Holden Karau
>Priority: Minor
>  Labels: bulk-closed
>
> While we've done some good work with better handling when Spark is choosing 
> to decommission nodes (SPARK-7955), it might make sense in environments where 
> we get preempted without our own choice (e.g. YARN over-commit, EC2 spot 
> instances, GCE Preemptiable instances, etc.) to do something for the data on 
> the node (or at least not schedule any new tasks).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22963) Make failure recovery global and automatic for continuous processing.

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22963.
--
Resolution: Incomplete

> Make failure recovery global and automatic for continuous processing.
> -
>
> Key: SPARK-22963
> URL: https://issues.apache.org/jira/browse/SPARK-22963
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Priority: Major
>  Labels: bulk-closed
>
> Spark native task restarts don't work well for continuous processing. They 
> will process all data from the task's original start offset - even data which 
> has already been committed. This is not semantically incorrect under at least 
> once semantics, but it's awkward and bad.
> Fortunately, they're also not necessary; the central coordinator can restart 
> every task from the checkpointed offsets without losing much. So we should 
> make that happen automatically on task failures instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24733) Dataframe saved to parquet can have different metadata then the resulting parquet file

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24733.
--
Resolution: Incomplete

> Dataframe saved to parquet can have different metadata then the resulting 
> parquet file
> --
>
> Key: SPARK-24733
> URL: https://issues.apache.org/jira/browse/SPARK-24733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: David Herskovics
>Priority: Minor
>  Labels: bulk-closed
>
> See the repro using spark-shell below:
> Let's say that we have a dataframe called *df_with_metadata* which has column 
> *name* with metadata.
>  
> {code:scala}
> scala> df_with_metadata.schema.json // Check that we have the metadata here.
> scala> df_with_metadata.createOrReplaceTempView("input")
> scala> val df2 = spark.sql("select case when true then name else null end as 
> name from input")
> scala> df2.schema.json // We don't have the metadata anymore.
> scala> df2.write.parquet("no_metadata_expected")
> scala> val df3 = spark.read.parquet("no_metadata_expected")
> scala> df3.schema.json // And the metadata is there again so the 
> no_metadata_expected does have metadata.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21812) PySpark ML Models should not depend transfering params from Java

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21812.
--
Resolution: Incomplete

> PySpark ML Models should not depend transfering params from Java
> 
>
> Key: SPARK-21812
> URL: https://issues.apache.org/jira/browse/SPARK-21812
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>
> After SPARK-10931 we should fix this so that the Python parameters are 
> explicitly defined instead of relying on copying them from Java. This can be 
> done in batches of models as sub issues since the number of params to be 
> update could be quite large.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24845) spark distribution generate exception while locally worked correctly

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24845.
--
Resolution: Incomplete

> spark distribution generate exception while locally worked correctly
> 
>
> Key: SPARK-24845
> URL: https://issues.apache.org/jira/browse/SPARK-24845
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.3
> Environment: _I set spark.driver.extraClassPath_ and 
> _spark.executor.extraClassPath_ environment per machine in 
> spark-defaults.conf file as: 
> {{/opt/spark/jars/*:/opt/hbase/lib/commons-collections-3.2.2.jar:/opt/hbase/lib/commons-httpclient-3.1.jar:/opt/hbase/lib/findbugs-annotations-1.3.9-1.jar:/opt/hbase/lib/hbase-annotations-1.2.6.jar:/opt/hbase/lib/hbase-annotations-1.2.6-tests.jar:/opt/hbase/lib/hbase-client-1.2.6.jar:/opt/hbase/lib/hbase-common-1.2.6.jar:/opt/hbase/lib/hbase-common-1.2.6-tests.jar:/opt/hbase/lib/hbase-examples-1.2.6.jar:/opt/hbase/lib/hbase-external-blockcache-1.2.6.jar:/opt/hbase/lib/hbase-hadoop2-compat-1.2.6.jar:/opt/hbase/lib/hbase-hadoop-compat-1.2.6.jar:/opt/hbase/lib/hbase-it-1.2.6.jar:/opt/hbase/lib/hbase-it-1.2.6-tests.jar:/opt/hbase/lib/hbase-prefix-tree-1.2.6.jar:/opt/hbase/lib/hbase-procedure-1.2.6.jar:/opt/hbase/lib/hbase-protocol-1.2.6.jar:/opt/hbase/lib/hbase-resource-bundle-1.2.6.jar:/opt/hbase/lib/hbase-rest-1.2.6.jar:/opt/hbase/lib/hbase-server-1.2.6.jar:/opt/hbase/lib/hbase-server-1.2.6-tests.jar:/opt/hbase/lib/hbase-shell-1.2.6.jar:/opt/hbase/lib/hbase-thrift-1.2.6.jar:/opt/hbase/lib/jetty-util-6.1.26.jar:/opt/hbase/lib/ruby/hbase:/opt/hbase/lib/ruby/hbase/hbase.rb:/opt/hbase/lib/ruby/hbase.rb:/opt/hbase/lib/protobuf-java-2.5.0.jar:/opt/hbase/lib/metrics-core-2.2.0.jar:/opt/hbase/lib/htrace-core-3.1.0-incubating.jar:/opt/hbase/lib/guava-12.0.1.jar:/opt/hbase/lib/asm-3.1.jar:/opt/hbase/lib/Cdrpackage.jar:/opt/hbase/lib/commons-daemon-1.0.13.jar:/opt/hbase/lib/commons-el-1.0.jar:/opt/hbase/lib/commons-math-2.2.jar:/opt/hbase/lib/disruptor-3.3.0.jar:/opt/hbase/lib/jamon-runtime-2.4.1.jar:/opt/hbase/lib/jasper-compiler-5.5.23.jar:/opt/hbase/lib/jasper-runtime-5.5.23.jar:/opt/hbase/lib/jaxb-impl-2.2.3-1.jar:/opt/hbase/lib/jcodings-1.0.8.jar:/opt/hbase/lib/jersey-core-1.9.jar:/opt/hbase/lib/jersey-guice-1.9.jar:/opt/hbase/lib/jersey-json-1.9.jar:/opt/hbase/lib/jettison-1.3.3.jar:/opt/hbase/lib/jetty-sslengine-6.1.26.jar:/opt/hbase/lib/joni-2.1.2.jar:/opt/hbase/lib/jruby-complete-1.6.8.jar:/opt/hbase/lib/jsch-0.1.42.jar:/opt/hbase/lib/jsp-2.1-6.1.14.jar:/opt/hbase/lib/junit-4.12.jar:/opt/hbase/lib/servlet-api-2.5-6.1.14.jar:/opt/hbase/lib/servlet-api-2.5.jar:/opt/hbase/lib/spymemcached-2.11.6.jar:/opt/hive-hbase//opt/hive-hbase/hive-hbase-handler-2.0.1.jar}}
>Reporter: Hossein Vatani
>Priority: Major
>  Labels: bulk-closed
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> we tried to read HBase table data with a distributed spark on three servers.
> OS: ubuntu 14.04
> hadoop 2.7.3
>  hbase 1.2.6
>  first I lunch spark shell with +spark-shell --master spark://master:7077+ 
> command and run:
> _{color:#707070}import org.apache.hadoop.hbase.util.Bytes
> import org.apache.hadoop.hbase.client.{HBaseAdmin, Result, Put, HTable}
> import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor, 
>   HColumnDescriptor }
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
> import org.apache.hadoop.hbase.client.TableDescriptor
> import org.apache.spark._
> import org.apache.spark.rdd.NewHadoopRDD
> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
> import org.apache.hadoop.hbase.client.HBaseAdmin
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.hbase.HColumnDescriptor
> import org.apache.hadoop.hbase.util.Bytes
> import org.apache.hadoop.hbase.client.Put;
> import org.apache.hadoop.hbase.client.HTable;
> import org.apache.hadoop.conf.Configuration
> import scala.collection.JavaConverters._
> val conf = HBaseConfiguration.create()
> val tablename = "default:Table1"
> conf.set(TableInputFormat.INPUT_TABLE,tablename)
> val admin = new HBaseAdmin(conf)
> admin.isTableAvailable(tablename) <-- it return true, it 
> val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], 
>   classOf[ImmutableBytesWritable], classOf[Result])
> hBaseRDD.count(){color}_
> and it generated below:
> *{color:#f79232}java.lang.IllegalStateException: unread block data
> at 
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2776)
> at

[jira] [Resolved] (SPARK-22780) make insert commands have real children to fix UI issues

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22780.
--
Resolution: Incomplete

> make insert commands have real children to fix UI issues
> 
>
> Key: SPARK-22780
> URL: https://issues.apache.org/jira/browse/SPARK-22780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24650) GroupingSet

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24650.
--
Resolution: Incomplete

> GroupingSet
> ---
>
> Key: SPARK-24650
> URL: https://issues.apache.org/jira/browse/SPARK-24650
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: CDH 5.X, Spark 2.3
>Reporter: Mihir Sahu
>Priority: Major
>  Labels: Grouping, Sets, bulk-closed
>
> If a grouping set is used in spark sql, then the plan does not perform 
> optimally.
> If input to a grouping set is X rows and the grouping sets has y group, then 
> the number of rows that are processed is currently x*y rows.
> Example : Let a Dataframe have  col1, col2, col3 and col4 columns and number 
> of row be rowNo.
> and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2
> Number of row processed in such case is 3*(rowNos * size of each row).
> However is this the optimal way of processing data.
> If the groups of y are derivable for each other, can we reduce the amount of 
> volume processed by removing columns as we progress to the lower dimension of 
> processing.
> Currently while doing processing percentile, a lot of data seems to be 
> processed causing performance issue.
> Need to look if this can be optimised



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24623) Hadoop - Spark Cluster - Python XGBoost - Not working in distributed mode

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24623.
--
Resolution: Incomplete

> Hadoop - Spark Cluster - Python XGBoost - Not working in distributed mode
> -
>
> Key: SPARK-24623
> URL: https://issues.apache.org/jira/browse/SPARK-24623
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: Hadoop - Hortonworks Cluster
>  
> Total Nodes - 18
> Worker Nodes - 13
>Reporter: Abhishek Reddy Chamakura
>Priority: Major
>  Labels: bulk-closed
>
> Hi
> We recently installed python on the Hadoop cluster with lot of data science 
> python modules including xgboost , spicy , scikit learn , pandas
> Using pyspark the data scientists are able to test there scoring models in 
> the distributed mode on the Hadoop cluster. But with python - xgboost the 
> pyspark job is not getting distributed and it is trying to run only on one 
> instance.
> we are trying to achieve the distributed mode when using python xgboost via 
> pyspark. 
> It would be a great help if you can direct me on how to achieve this.
> Thanks,
> Abhishek



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24265) lintr checks not failing PR build

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24265.
--
Resolution: Incomplete

> lintr checks not failing PR build
> -
>
> Key: SPARK-24265
> URL: https://issues.apache.org/jira/browse/SPARK-24265
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Felix Cheung
>Priority: Major
>  Labels: bulk-closed
>
> a few lintr violations went through recently, need to check why they are not 
> flagged by Jenkins build



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19498.
--
Resolution: Incomplete

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> 
>
> Key: SPARK-19498
> URL: https://issues.apache.org/jira/browse/SPARK-19498
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>  Labels: bulk-closed
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we 
> can make MLlib DataFrame-based APIs more extensible, especially for the 
> purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs 
> (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes 
> before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or 
> extensive enough?
> The easy answer is to make everything public, but that would be terrible of 
> course in the long-term.  Let's discuss what is needed and how we can present 
> stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8696) Streaming API for Online LDA

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8696.
-
Resolution: Incomplete

> Streaming API for Online LDA
> 
>
> Key: SPARK-8696
> URL: https://issues.apache.org/jira/browse/SPARK-8696
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Priority: Major
>  Labels: bulk-closed
>
> Streaming LDA can be a natural extension from online LDA. 
> Yet for now we need to settle down the implementation for LDA prediction, to 
> support the predictOn method in the streaming version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8767) Abstractions for InputColParam, OutputColParam

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-8767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8767.
-
Resolution: Incomplete

> Abstractions for InputColParam, OutputColParam
> --
>
> Key: SPARK-8767
> URL: https://issues.apache.org/jira/browse/SPARK-8767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I'd like to create Param subclasses for output and input columns.  These will 
> provide easier schema checking, which could even be done automatically in an 
> abstraction rather than in each class.  That should simplify things for 
> developers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24016) Yarn does not update node blacklist in static allocation

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24016.
--
Resolution: Incomplete

> Yarn does not update node blacklist in static allocation
> 
>
> Key: SPARK-24016
> URL: https://issues.apache.org/jira/browse/SPARK-24016
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: bulk-closed
>
> Task-based blacklisting keeps track of bad nodes, and updates YARN with that 
> set of nodes so that Spark will not receive more containers on that node.  
> However, that only happens with dynamic allocation.  Though its far more 
> important with dynamic allocation, even with static allocation this matters; 
> if executors die, or if the cluster was too busy at the original resource 
> request to give all the containers, the spark application will add new 
> containers in the middle.  And we want an updated node blacklist for that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25232) Support Full-Text Search in Spark SQL

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25232.
--
Resolution: Incomplete

> Support Full-Text Search in Spark SQL
> -
>
> Key: SPARK-25232
> URL: https://issues.apache.org/jira/browse/SPARK-25232
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Lijie Xu
>Priority: Major
>  Labels: bulk-closed
>
> Full-text search (i.e., keyword search) is widely used in search engines and 
> relational databases such as MATCH() ... AGAINST operator in MySQL 
> (https://dev.mysql.com/doc/en/fulltext-search.html), Text query in Oracle 
> (https://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm#g1016054), 
> and text search in PostgreSQL 
> (https://www.postgresql.org/docs/9.5/static/textsearch.html). However, it is 
> not natively supported in Spark SQL. We propose an approach to implement this 
> full-text search in Spark SQL.
> Our proposed approach is detailed  at 
> [https://github.com/JerryLead/Misc/blob/master/FullTextSearch/Full-text-issue-2018.pdf]
> and the prototype is available at 
> [https://github.com/bigdata-iscas/SparkFullTextQuery/tree/like_explorer]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24106.
--
Resolution: Incomplete

> Spark Structure Streaming with RF model taking long time in processing 
> probability for each mini batch
> --
>
> Key: SPARK-24106
> URL: https://issues.apache.org/jira/browse/SPARK-24106
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark yarn / Standalone cluster
> 2 master nodes - 32 cores - 124 GB
> 9 worker nodes - 32 cores - 124 GB
> Kafka input and output topic with 6 partition
>Reporter: Tamilselvan Veeramani
>Priority: Major
>  Labels: bulk-closed, performance
>
> RandomForestClassificationModel broadcasted to executors for every mini batch 
> in spark streaming while try to find probability
> RF model size 45MB
> spark kafka streaming job jar size 8 MB (including kafka dependency’s)
> following log show model broad cast to executors for every mini batch when we 
> call rf_model.transform(dataset).select("probability").
> due to which task deserialization time also increases comes to 6 to 7 second 
> for 45MB of rf model, although processing time is just 400 to 600 ms for mini 
> batch
> 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: 
> KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5))
>  18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
> 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory 
> on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)
> After 2 to 3 weeks of struggle, I found a potentially solution which will 
> help many people who is looking to use RF model for “probability” in real 
> time streaming context
> Since RandomForestClassificationModel class of transformImpl method 
> implements only “prediction” in current version of spark. Which can be 
> leveraged to implement “probability” also in RandomForestClassificationModel 
> class of transformImpl method.
> I have modified the code and implemented in our server and it’s working as 
> fast as 400ms to 500ms for every mini batch
> I see many people our there facing this issue and no solution provided in any 
> of the forums, Can you please review and put this fix in next release ? thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24568) Code refactoring for DataType equalsXXX methods

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24568.
--
Resolution: Incomplete

> Code refactoring for DataType equalsXXX methods
> ---
>
> Key: SPARK-24568
> URL: https://issues.apache.org/jira/browse/SPARK-24568
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wei Xue
>Priority: Major
>  Labels: bulk-closed
>
> Right now there is a lot of code duplication between all DataType equalsXXX 
> methods: {{equalsIgnoreNullability}}, {{equalsIgnoreCaseAndNullability}}, 
> {{equalsIgnoreCaseAndNullability}}, {{equalsStructurally}}. We can replace 
> the dup code with a helper function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24862.
--
Resolution: Incomplete

> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists
> 
>
> Key: SPARK-24862
> URL: https://issues.apache.org/jira/browse/SPARK-24862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Antonio Murgia
>Priority: Major
>  Labels: bulk-closed
>
> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists.
> For example if I create a case class with multiple constructor argument lists:
> {code:java}
> case class Multi(x: String)(y: Int){code}
> Scala creates a product with arity 1, while if I apply 
> {code:java}
> Encoders.product[Multi].schema.printTreeString{code}
> I get
> {code:java}
> root
> |-- x: string (nullable = true)
> |-- y: integer (nullable = false){code}
> That is not consistent and leads to:
> {code:java}
> Error while encoding: java.lang.RuntimeException: Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
> at org.scalatest.Transformer.apply(Transformer.scala:20)
> at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682)
> at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
> at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692)
> at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:373)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:410)
> at 
>

[jira] [Resolved] (SPARK-24461) Snapshot Cache

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24461.
--
Resolution: Incomplete

> Snapshot Cache
> --
>
> Key: SPARK-24461
> URL: https://issues.apache.org/jira/browse/SPARK-24461
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Wei Xue
>Priority: Major
>  Labels: bulk-closed
>
> In some usage scenarios, data staleness is not critical. We can introduce a 
> snapshot cache of the original data for achieving much better performance. 
> Different from the current cache, it is resolved by names instead of by plan 
> matching. Cache rebuild can be manually or by events. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20782) Dataset's isCached operator

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20782.
--
Resolution: Incomplete

> Dataset's isCached operator
> ---
>
> Key: SPARK-20782
> URL: https://issues.apache.org/jira/browse/SPARK-20782
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: bulk-closed
>
> It'd be very convenient to have {{isCached}} operator that would say whether 
> a query is cached in-memory or not.
> It'd be as simple as the following snippet:
> {code}
> // val q2: DataFrame
> spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23068) Jekyll doc build error does not fail build

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23068.
--
Resolution: Incomplete

> Jekyll doc build error does not fail build
> --
>
> Key: SPARK-23068
> URL: https://issues.apache.org/jira/browse/SPARK-23068
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>  Labels: bulk-closed
>
> +++ /usr/local/bin/Rscript -e ' if("devtools" %in% 
> rownames(installed.packages())) { library(devtools); 
> devtools::document(pkg="./pkg", roclets=c("rd")) }'
> Error: 'roxygen2' >= 5.0.0 must be installed for this functionality.
> Execution halted
> jekyll 3.7.0 | Error:  R doc generation failed
> See SPARK-23065



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20885) JDBC predicate pushdown uses hardcoded date format

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20885.
--
Resolution: Incomplete

> JDBC predicate pushdown uses hardcoded date format
> --
>
> Key: SPARK-20885
> URL: https://issues.apache.org/jira/browse/SPARK-20885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Peter Halverson
>Priority: Minor
>  Labels: bulk-closed
>
> If a date literal is used in a pushed-down filter expression, e.g.
> {code}
> val postingDate = java.sql.Date.valueOf("2016-06-03")
> val count = jdbcDF.filter($"POSTINGDATE" === postingDate).count
> {code}
> where the {{POSTINGDATE}} column is of JDBC type Date, the resulting 
> pushed-down SQL query looks like the following:
> {code}
> SELECT ..  ... FROM  WHERE POSTINGDATE = '2016-06-03'
> {code}
> Specifically, the date is compiled into a string literal using the hardcoded 
> -MM-dd format that {{java.sql.Date.toString}} emits. Note the implied 
> string conversion for date (and timestamp) values in {{JDBCRDD.compileValue}}
> {code}
>   /**
>* Converts value to SQL expression.
>*/
>   private def compileValue(value: Any): Any = value match {
> case stringValue: String => s"'${escapeSql(stringValue)}'"
> case timestampValue: Timestamp => "'" + timestampValue + "'"
> case dateValue: Date => "'" + dateValue + "'"
> case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
> case _ => value
>   }
> {code}
> The resulting query fails if the database is expecting a different format for 
> date string literals. For example, the default format for Oracle is 
> 'dd-MMM-yy', so when the relation query is executed, it fails with a syntax 
> error.
> {code}
> ORA-01861: literal does not match format string
> 01861. 0 -  "literal does not match format string"
> {code}
> In some situations it may be possible to change the database's expected date 
> format to match the Java format, but this is not always possible (e.g. 
> reading from an external database server)
> Shouldn't this kind of conversion be going through some kind of vendor 
> specific translation (e.g. through a {{JDBCDialect}})?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24597) Spark ML Pipeline Should support non-linear models => DAGPipeline

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24597.
--
Resolution: Incomplete

> Spark ML Pipeline Should support non-linear models => DAGPipeline
> -
>
> Key: SPARK-24597
> URL: https://issues.apache.org/jira/browse/SPARK-24597
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Michael Dreibelbis
>Priority: Minor
>  Labels: bulk-closed
>
> Currently SparkML Pipeline/PipelineModel supports single linear dataset 
> transformation
> despite the documentation stating otherwise:
> [reference 
> documentation|https://spark.apache.org/docs/2.3.0/ml-pipeline.html#details] 
>  I'm proposing implementing a DAGPipeline and supporting multiple datasets as 
> input
> The code could look something like this:
>  
> {code:java}
> val ds1 = /*dataset 1 creation*/
> val ds2 = /*dataset 2 creation*/
> // nodes take on uid from estimator/transformer
> val i1 = IdentityNode(new IdentityTransformer("i1"))
> val i2 = IdentityNode(new IdentityTransformer("i2"))
> val bi = TransformerNode(new Binarizer("bi"))
> val cv = EstimatorNode(new CountVectorizer("cv"))
> val idf = EstimatorNode(new IDF("idf"))
> val j1 = JoinerNode(new Joiner("j1"))
> val nodes = Array(i1, i2, bi, cv, idf)
> val edges = Array(
> ("i1", "cv"), ("cv", "idf"), ("idf", "j1"), 
> ("i2", "bi"), ("bi", "j1"))
> val p = new DAGPipeline(nodes, edges)
> .setIdentity("i1", ds1)
> .setIdentity("i2", ds2)
> val m = p.fit(spark.emptyDataFrame)
> m.setIdentity("i1", ds1).setIdentity("i2", ds2)
> m.transform(spark.emptyDataFrame)
> {code}
>  
>  
>          
>           



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17602.
--
Resolution: Incomplete

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
>Priority: Major
>  Labels: bulk-closed
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20007) Make SparkR apply() functions robust to workers that return empty data.frame

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20007.
--
Resolution: Incomplete

> Make SparkR apply() functions robust to workers that return empty data.frame
> 
>
> Key: SPARK-20007
> URL: https://issues.apache.org/jira/browse/SPARK-20007
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: bulk-closed
>
> When using {{gapply()}} (or other members of {{apply()}} family) with a 
> schema, Spark will try to parse data returned form the R process on each 
> worker as Spark DataFrame Rows based on the schema. In this case our provided 
> schema suggests that we have six column. When an R worker returns results to 
> JVM, SparkSQL will try to access its columns one by one and cast them to 
> proper types. If R worker returns nothing, JVM will throw 
> {{ArrayIndexOutOfBoundsException}} exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17129) Support statistics collection and cardinality estimation for partitioned tables

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17129.
--
Resolution: Incomplete

> Support statistics collection and cardinality estimation for partitioned 
> tables
> ---
>
> Key: SPARK-17129
> URL: https://issues.apache.org/jira/browse/SPARK-17129
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Priority: Major
>  Labels: bulk-closed
>
> Support statistics collection and cardinality estimation for partitioned 
> tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22054) Allow release managers to inject their keys

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22054.
--
Resolution: Incomplete

> Allow release managers to inject their keys
> ---
>
> Key: SPARK-22054
> URL: https://issues.apache.org/jira/browse/SPARK-22054
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>
> Right now the current release process signs with Patrick's keys, let's update 
> the scripts to allow the release manager to sign the release as part of the 
> job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23632.
--
Resolution: Incomplete

> sparkR.session() error with spark packages - JVM is not ready after 10 seconds
> --
>
> Key: SPARK-23632
> URL: https://issues.apache.org/jira/browse/SPARK-23632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Jaehyeon Kim
>Priority: Minor
>  Labels: bulk-closed
>
> Hi
> When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ 
> as following,
> {code:java}
> library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib'))
> ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080'
> sparkR.session(master = "spark://master:7077",
>appName = 'ml demo',
>sparkConfig = list(spark.driver.memory = '2g'), 
>sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2',
>spark.driver.extraJavaOptions = ext_opts)
> {code}
> I see *JVM is not ready after 10 seconds* error. Below shows some of the log 
> messages.
> {code:java}
> Ivy Default Cache set to: /home/rstudio/.ivy2/cache
> The jars for the packages stored in: /home/rstudio/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hadoop#hadoop-aws added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found org.apache.hadoop#hadoop-aws;2.8.2 in central
> ...
> ...
>   found javax.servlet.jsp#jsp-api;2.1 in central
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  : 
>   JVM is not ready after 10 seconds
> ...
> ...
>   found joda-time#joda-time;2.9.4 in central
> downloading 
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar
>  ...
> ...
> ...
>   xmlenc#xmlenc;0.52 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   76  |   76  |   76  |   0   ||   76  |   76  |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   76 artifacts copied, 0 already retrieved (27334kB/56ms)
> {code}
> It's fine if I re-execute it after the package and its dependencies are 
> downloaded.
> I consider it's because of this part - 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181
> {code:java}
> if (!file.exists(path)) {
>   stop("JVM is not ready after 10 seconds")
> }
> {code}
> Just wonder if it may be possible to update so that a user can determine how 
> much to wait?
> Thanks.
> Regards
> Jaehyeon



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23452) Extend test coverage to all ORC readers

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23452.
--
Resolution: Incomplete

> Extend test coverage to all ORC readers
> ---
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: bulk-closed
>
> We have five ORC readers. We had better have a test coverage for all ORC 
> readers.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22658.
--
Resolution: Incomplete

> SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
> 
>
> Key: SPARK-22658
> URL: https://issues.apache.org/jira/browse/SPARK-22658
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Andy Feng
>Priority: Major
>  Labels: bulk-closed
> Attachments: SPIP_ TensorFlowOnSpark.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> TensorFlowOnSpark (TFoS) was released at github for distributed TensorFlow 
> training and inference on Apache Spark clusters. TFoS is designed to:
> * Easily migrate all existing TensorFlow programs with minimum code change;
> * Support all TensorFlow functionalities: synchronous/asynchronous training, 
> model/data parallelism, inference and TensorBoard;
> * Easily integrate with your existing data processing pipelines (ex. Spark 
> SQL) and machine learning algorithms (ex. MLlib);
> * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and 
> Infiniband.
> We propose to merge TFoS into Apache Spark as a scalable deep learning 
> library to:
> * Make deep learning easy for Apache Spark community: Familiar pipeline API 
> for training and inference; Enable TensorFlow training/inference on existing 
> Spark clusters.
> * Further simplify data scientist experience: Ensure compatibility b/w Apache 
> Spark and TFoS; Reduce steps for installation.
> * Help Apache Spark evolutions on deep learning: Establish a design pattern 
> for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL 
> training/inference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15694) Implement ScriptTransformation in sql/core

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15694.
--
Resolution: Incomplete

> Implement ScriptTransformation in sql/core
> --
>
> Key: SPARK-15694
> URL: https://issues.apache.org/jira/browse/SPARK-15694
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21166) Automated ML persistence

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21166.
--
Resolution: Incomplete

> Automated ML persistence
> 
>
> Key: SPARK-21166
> URL: https://issues.apache.org/jira/browse/SPARK-21166
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> This JIRA is for discussing the possibility of automating ML persistence.  
> Currently, custom save/load methods are written for every Model.  However, we 
> could design a mixin which provides automated persistence, inspecting model 
> data and Params and reading/writing (known types) automatically.  This was 
> brought up in discussions with developers behind 
> https://github.com/azure/mmlspark
> Some issues we will need to consider:
> * Providing generic mixin usable in most or all cases
> * Handling corner cases (strange Param types, etc.)
> * Backwards compatibility (loading models saved by old Spark versions)
> Because of backwards compatibility in particular, it may make sense to 
> implement testing for that first, before we try to address automated 
> persistence: [SPARK-15573]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23777) Missing DAG arrows between stages

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23777.
--
Resolution: Incomplete

> Missing DAG arrows between stages
> -
>
> Key: SPARK-23777
> URL: https://issues.apache.org/jira/browse/SPARK-23777
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0, 2.3.0
>Reporter: Stefano Pettini
>Priority: Trivial
>  Labels: bulk-closed
> Attachments: Screenshot-2018-3-23 RDDTestApp - Details for Job 0.png
>
>
> In the Spark UI DAGs, sometimes there are missing arrows between stages. It 
> seems to happen when the same RDD is shuffled twice.
> For example in this case:
> {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 10)}}
> {{val b = a join a}}
> {{b.collect()}}
> There's a missing arrow from stage 1 to 2.
> _This is an old one, since 1.3.0 at least, still reproducible in 2.3.0._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23227) Add user guide entry for collecting sub models for cross-validation classes

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23227.
--
Resolution: Incomplete

> Add user guide entry for collecting sub models for cross-validation classes
> ---
>
> Key: SPARK-23227
> URL: https://issues.apache.org/jira/browse/SPARK-23227
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nicholas Pentreath
>Priority: Minor
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24406) Exposing custom spark scala ml transformers in pyspark

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24406.
--
Resolution: Incomplete

> Exposing custom spark scala ml transformers in pyspark 
> ---
>
> Key: SPARK-24406
> URL: https://issues.apache.org/jira/browse/SPARK-24406
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Pratyush Sharma
>Priority: Minor
>  Labels: bulk-closed
>
> How can I use a custom transformer written in scala in a pyspark pipeline.
> {code:java}
> class UpperTransformer(override val uid: String)
>     extends UnaryTransformer[String, String, UpperTransformer] {
>     
>   def this() = this(Identifiable.randomUID("upper"))
>     
>   override protected def validateInputType(inputType: DataType): Unit = {
>     require(inputType == StringType)
>   }
>     
>   protected def createTransformFunc: String => String = {
>     _.toUpperCase
>   }
>     
>   protected def outputDataType: DataType = StringType
>     }{code}
>  
> Use this transformer in pyspark pipeline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25340) Pushes down Sample beneath deterministic Project

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25340.
--
Resolution: Incomplete

> Pushes down Sample beneath deterministic Project
> 
>
> Key: SPARK-25340
> URL: https://issues.apache.org/jira/browse/SPARK-25340
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>  Labels: bulk-closed
>
> If computations in Project are heavy (e.g., UDFs), it is useful to push down 
> sample nodes into deterministic projects;
> {code}
> scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true)
> // without this proposal
> == Analyzed Logical Plan ==
> (id + 3): bigint
> Sample 0.0, 0.5, false, 3370873312340343855
> +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L]
>+- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Sample 0.0, 0.5, false, 3370873312340343855
> +- Project [(id#0L + 3) AS (id + 3)#2L]
>+- Range (0, 10, step=1, splits=Some(4))
> // with this proposal
> == Optimized Logical Plan ==
> Project [(id#0L + 3) AS (id + 3)#2L]
> +- Sample 0.0, 0.5, false, -6519017078291024113
>+- Range (0, 10, step=1, splits=Some(4))
> {code}
> POC: https://github.com/apache/spark/compare/master...maropu:SamplePushdown



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23744) Memory leak in ReadableChannelFileRegion

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23744.
--
Resolution: Incomplete

> Memory leak in ReadableChannelFileRegion
> 
>
> Key: SPARK-23744
> URL: https://issues.apache.org/jira/browse/SPARK-23744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>  Labels: bulk-closed
>
> In the class _ReadableChannelFileRegion_,  the _buffer_ is direct memory, we 
> should  modify  _deallocate_  to free it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23073) Fix incorrect R doc page header for generated sql functions

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23073.
--
Resolution: Incomplete

> Fix incorrect R doc page header for generated sql functions
> ---
>
> Key: SPARK-23073
> URL: https://issues.apache.org/jira/browse/SPARK-23073
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: bulk-closed
> Attachments: Screen Shot 2018-01-14 at 11.11.05 AM.png
>
>
> See title says 
> {code}
> asc {SparkR}
> {code}
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html
> http://spark.apache.org/docs/latest/api/R/columnfunctions.html
> asc, contains etc are functions generated at runtime. Because of that, their 
> doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 
> picks the doc page title from the first function name by default, in the 
> presence of any function it can parse.
> An attempt to fix here 
> https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824
> {code}
> #' @rdname columnfunctions
>   #' @export
>  +#' @name NULL
>   setGeneric("asc", function(x) { standardGeneric("asc") })
> {code}
> But it cause a more severe issue to fail CRAN checks
> {code}
> * checking for missing documentation entries ... WARNING
> Undocumented code objects:
>   'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNotNull'
>   'isNull' 'like' 'rlike'
> All user-level objects in a package should have documentation entries.
> See the chapter 'Writing R documentation files' in the 'Writing R
> Extensions' manual.
> * checking for code/documentation mismatches ... OK
> * checking Rd \usage sections ... WARNING
> Objects in \usage without \alias in documentation object 'columnfunctions':
>   'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNull'
>   'isNotNull' 'like' 'rlike'
> {code}
> To follow up we should
> - look for a way to set the doc page title
> - http://spark.apache.org/docs/latest/api/R/columnfunctions.html is really 
> barebone and we should explicitly add a doc page with content (which could 
> also address the first point)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16707) TransportClientFactory.createClient may throw NPE

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-16707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16707.
--
Resolution: Incomplete

> TransportClientFactory.createClient may throw NPE
> -
>
> Key: SPARK-16707
> URL: https://issues.apache.org/jira/browse/SPARK-16707
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Hong Shen
>Priority: Major
>  Labels: bulk-closed
>
> I have encounter  some NullPointerException when 
> TransportClientFactory.createClient in my cluster, here is the following 
> stack trace.
> {code}
> org.apache.spark.shuffle.FetchFailedException
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:144)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:107)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:146)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:126)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:155)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:319)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299)
>   ... 32 more
> {code}
> The code is at 
>

[jira] [Resolved] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25244.
--
Resolution: Incomplete

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Anton Daitche
>Priority: Major
>  Labels: bulk-closed
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up 
> with a patch. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23983) Disable X-Frame-Options from Spark UI response headers if explicitly configured

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23983.
--
Resolution: Incomplete

> Disable X-Frame-Options from Spark UI response headers if explicitly 
> configured
> ---
>
> Key: SPARK-23983
> URL: https://issues.apache.org/jira/browse/SPARK-23983
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Taylor Cressy
>Priority: Minor
>  Labels: UI, bulk-closed
>
> We should introduce a configuration for the spark UI to omit X-Frame-Options 
> from the response headers if explicitly set.
> The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* 
> to prevent frame-related click-jacking vulnerabilities. This was addressed 
> in: SPARK-10589
>  
> {code:java}
> val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")
> val xFrameOptionsValue =
>allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN")
> ...
> // In doGet
> response.setHeader("X-Frame-Options", xFrameOptionsValue)
> {code}
>  
> The problem with this, is that we only allow the same origin or a singular 
> host to present the UI with iframes. I propose we add a configuration that 
> turns this off.
>  
> Use Case: Currently building a "portal UI" for all things related to a 
> cluster. Embedding the spark UI in the portal is necessary because the 
> cluster is in the cloud and can only be accessed via an SSH tunnel - as 
> intended. (The reverse proxy configuration {{*_spark.ui.reverseProxy_* could 
> be used to simplify connecting to all the workers}}, but this doesn't solve 
> handling multiple, unrelated, UIs through a single tunnel.
>  
> Moreover, the host that our "portal UI" would reside on is not assigned a 
> hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't 
> useful in this case.
>  
> Lastly, the current design does not allow for different hosts to be 
> configured, i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not 
> a valid config.
>  
> An alternative option would be to explore Content-Security-Policy: 
> [https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-12449.
--
Resolution: Incomplete

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
>Priority: Major
>  Labels: bulk-closed
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21972.
--
Resolution: Incomplete

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>Priority: Major
>  Labels: bulk-closed
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20598) Iterative checkpoints do not get removed from HDFS

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20598.
--
Resolution: Incomplete

> Iterative checkpoints do not get removed from HDFS
> --
>
> Key: SPARK-20598
> URL: https://issues.apache.org/jira/browse/SPARK-20598
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, YARN
>Affects Versions: 2.1.0
>Reporter: Guillem Palou
>Priority: Major
>  Labels: bulk-closed
>
> I am running a pyspark  application that makes use of dataframe.checkpoint() 
> because Spark needs exponential time to compute the plan and eventually I had 
> to stop it. Using {{checkpoint}} allowed the application to proceed with the 
> computation, but I noticed that the HDFS cluster was filling up with RDD 
> files. Spark is running on YARN client mode. 
> I managed to reproduce the problem in a toy example as below:
> {code}
> df = spark.createDataFrame([T.Row(a=1, b=2)]).checkpoint()
> for i in range(4):
> # either line of the following 2 will produce the error   
> df = df.select('*', F.concat(*df.columns)).cache().checkpoint()
> df = df.join(df, on='a').cache().checkpoint()
> # the following two lines do not seem to have an effect
> gc.collect()
> sc._jvm.System.gc()
> {code}
> After running the code and {{sc.top()}}, I can still see the rdd's 
> checkpointed in HDFS:
> {quote}
> guillem@ip-10-9-94-0:~$ hdfs dfs -du -h $CHECKPOINT_PATH
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-12
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-18
> 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-24
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-30
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-6
> {quote}
> The config flag {{spark.cleaner.referenceTracking.cleanCheckpoints}} is set 
> to {{true}}. I would expect Spark to clean up all RDDs that can't be 
> accessed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24260) Support for multi-statement SQL in SparkSession.sql API

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24260.
--
Resolution: Incomplete

> Support for multi-statement SQL in SparkSession.sql API
> ---
>
> Key: SPARK-24260
> URL: https://issues.apache.org/jira/browse/SPARK-24260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ravindra Nath Kakarla
>Priority: Minor
>  Labels: bulk-closed
>
> sparkSession.sql API only supports a single SQL statement to be executed for 
> a call. A multi-statement SQL cannot be executed in a single call. For 
> example,
> {code:java}
> SparkSession sparkSession = 
> SparkSession.builder().appName("MultiStatementSQL")                           
>                .master("local").config("", "").getOrCreate()
> sparkSession.sql("DROP TABLE IF EXISTS count_employees; CACHE TABLE 
> employees; CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt 
> FROM employees; SELECT * FROM count_employees") 
> {code}
> Above code fails with the error, 
> {code:java}
> org.apache.spark.sql.catalyst.parser.ParseException: mismatched input ';' 
> expecting {code}
> Solution to this problem is to use the .sql API multiple times in a specific 
> order.
> {code:java}
> sparkSession.sql("DROP TABLE IF EXISTS count_employees")
> sparkSession.sql("CACHE TABLE employees")
> sparkSession.sql("CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as 
> cnt FROM employees;")
> sparkSession.sql("SELECT * FROM count_employees")
> {code}
> If these SQL statements come from a string / file, users have to implement 
> their own parsers to execute this. Like,
> {code:java}
> val sqlFromFile = """DROP TABLE IF EXISTS count_employees;
>  |CACHE TABLE employees;
>  |CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM 
> employees; SELECT * FROM count_employees""".stripMargin{code}
> {code:java}
> sqlFromFile.split(";")
> .forEach(line => sparkSession.sql(line))
> {code}
> This naive parser can fail for many edge cases (like ";" inside a string). 
> Even if users use the same grammar used by Spark and implement their own 
> parsing, it can go out of sync with the way Spark parses the statements.
> Can support for multiple SQL statements be built into SparkSession.sql API 
> itself?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21707) Improvement a special case for non-deterministic filters in optimizer

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21707.
--
Resolution: Incomplete

> Improvement a special case for non-deterministic filters in optimizer
> -
>
> Key: SPARK-21707
> URL: https://issues.apache.org/jira/browse/SPARK-21707
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Priority: Major
>  Labels: bulk-closed
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic filters. Deal with that we only need to read user needs 
> fields for non-deterministic filters in optimizer.
> For example, the condition of filters is nondeterministic. e.g:contains 
> nondeterministic function(rand function), HiveTableScans optimizer generated:
> ```
> HiveTableScans plan:Aggregate [k#2L], [k#2L, k#2L, sum(cast(id#1 as bigint)) 
> AS sum(id)#395L]
> +- Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L]
>+- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && 
> NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
>   +- MetastoreRelation XXX_database, XXX_table
> HiveTableScans plan:Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L]
> +- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && 
> NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
>+- MetastoreRelation XXX_database, XXX_table
> HiveTableScans plan:Filter ((isnotnull(d004#205) && 
> (rand(-4530215890880734772) <= 0.5)) && NOT (cast(cast(d004#205 as 
> decimal(10,0)) as decimal(11,1)) = 0.0))
> +- MetastoreRelation XXX_database, XXX_table
> HiveTableScans plan:MetastoreRelation XXX_database, XXX_table
> ```
> so HiveTableScan will read all the fields from table. but we only need to 
> ‘d004’ and 'c010' . it will affect the performance of task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24731) java.io.IOException: s3n://bucketname: 400 : Bad Request

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24731.
--
Resolution: Incomplete

>  java.io.IOException: s3n://bucketname: 400 : Bad Request
> -
>
> Key: SPARK-24731
> URL: https://issues.apache.org/jira/browse/SPARK-24731
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.1
>Reporter: sivakphani
>Priority: Major
>  Labels: bulk-closed
>
> I wrote code for connecting aws s3 bucket for read json file through pyspark.
> when i submit in locally it getting this error
>  File "PYSPARK_examples/Pyspark11.py", line 105, in 
>      df=sqlContext.read.json(path)
>    File 
> "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>  line 261, in json
>    File 
> "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>    File 
> "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>    File 
> "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
>  py4j.protocol.Py4JJavaError: An error occurred while calling o29.json.
>  : java.io.IOException: s3n://bucketname: 400 : Bad Request
>      at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
>      at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
>      at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
>      at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>      at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>      at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:498)
>      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>      at org.apache.hadoop.fs.s3native.$Proxy12.retrieveMetadata(Unknown 
> Source)
>      at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
>      at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
>      at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:714)
>      at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
>      at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
>      at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>      at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>      at scala.collection.immutable.List.foreach(List.scala:381)
>      at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>      at scala.collection.immutable.List.flatMap(List.scala:344)
>      at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
>      at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>      at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>      at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>      at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:498)
>      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>      at py4j.Gateway.invoke(Gateway.java:282)
>      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>      at py4j.commands.CallCommand.execute(CallCommand.java:79)
>      at py4j.GatewayConnection.run(GatewayConnection.java:238)
>      at java.lang.Thread.run(Thread.java:748)
>  Caused by: org.jets3t.service.impl.rest.HttpException: 400 Bad Request
>      at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:425)
>      at 
>

[jira] [Resolved] (SPARK-21405) Add LBFGS solver for GeneralizedLinearRegression

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21405.
--
Resolution: Incomplete

> Add LBFGS solver for GeneralizedLinearRegression
> 
>
> Key: SPARK-21405
> URL: https://issues.apache.org/jira/browse/SPARK-21405
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Major
>  Labels: bulk-closed
>
> GeneralizedLinearRegression in Spark ML currently only allows 4096 features 
> because it uses IRLS, and hence WLS, as an optimizer which relies on 
> collecting the covariance matrix to the driver. GLMs can also be fit by 
> simple gradient based methods like LBFGS.
> The new API from 
> [SPARK-19762|https://issues.apache.org/jira/browse/SPARK-19762] makes this 
> easy to add. I've already prototyped it, and it works pretty well. This 
> change would allow an arbitrary number of features (up to what can fit on a 
> single node) as in Linear/Logistic regression.
> For reference, other GLM packages also support this - e.g. statsmodels, H2O.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22868) 64KB JVM bytecode limit problem with aggregation

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22868.
--
Resolution: Incomplete

> 64KB JVM bytecode limit problem with aggregation
> 
>
> Key: SPARK-22868
> URL: https://issues.apache.org/jira/browse/SPARK-22868
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>  Labels: bulk-closed
>
> The following programs can throw an exception due to the 64KB JVM bytecode 
> limit
> {code}
> val df = spark.sparkContext.parallelize(
>   Seq((1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.0, 11.1, 12.2, 
> 13.3, 14.4, 15.5, 16.6, 17.7, 18.8, 19.9, 20.0, 21.1, 22.2)),
>   1).toDF()
> df.agg(
>   kurtosis('_1), kurtosis('_2), kurtosis('_3), kurtosis('_4), 
> kurtosis('_5),
>   kurtosis('_6), kurtosis('_7), kurtosis('_8), kurtosis('_9), 
> kurtosis('_10),
>   kurtosis('_11), kurtosis('_12), kurtosis('_13), kurtosis('_14), 
> kurtosis('_15)
> ).collect
> df.groupBy('_22)
>   .agg(
> kurtosis('_1), kurtosis('_2), kurtosis('_3), kurtosis('_4), 
> kurtosis('_5),
> kurtosis('_6), kurtosis('_7), kurtosis('_8), kurtosis('_9), 
> kurtosis('_10),
> kurtosis('_11), kurtosis('_12), kurtosis('_13), kurtosis('_14), 
> kurtosis('_15)
> ).collect
> df.groupBy(
> round('_1, 0), round('_2, 0), round('_3, 0), round('_4, 0), 
> round('_5, 0),
> round('_6, 0), round('_7, 0), round('_8, 0), round('_9, 0), 
> round('_10, 0))
>   .agg(
> kurtosis('_1), kurtosis('_2), kurtosis('_3), kurtosis('_4), 
> kurtosis('_5),
> kurtosis('_6), kurtosis('_7)
> ).collect
> {code}
> */



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24447.
--
Resolution: Incomplete

> Pyspark RowMatrix.columnSimilarities() loses spark context
> --
>
> Key: SPARK-24447
> URL: https://issues.apache.org/jira/browse/SPARK-24447
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Perry Chu
>Priority: Minor
>  Labels: bulk-closed
>
> The RDD behind the CoordinateMatrix returned by 
> RowMatrix.columnSimilarities() appears to be losing track of the spark 
> context if spark is stopped and restarted in pyspark.
> I'm pretty new to spark - not sure if the problem is on the python side or 
> the scala side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> import pyspark
> from pyspark.mllib.linalg.distributed import RowMatrix
> spark.stop()
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
> matrix = RowMatrix(rows)
> sims = matrix.columnSimilarities()
> ## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
> print(sims.numRows(),sims.numCols())
> ## This throws an error (stack trace below)
> print(sims.entries.first())
> ## Later I tried this
> print(rows.context) #
> print(sims.entries.context) # PySparkShell>, then throws an error{code}
> Error stack trace
> {code:java}
> ---
> AttributeError Traceback (most recent call last)
>  in ()
> > 1 sims.entries.first()
> /usr/lib/spark/python/pyspark/rdd.py in first(self)
> 1374 ValueError: RDD is empty
> 1375 """
> -> 1376 rs = self.take(1)
> 1377 if rs:
> 1378 return rs[0]
> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
> 1356
> 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
> -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
> 1359
> 1360 items += res
> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
> partitions, allowLocal)
> 999 # SparkContext#runJob.
> 1000 mappedRDD = rdd.mapPartitions(partitionFunc)
> -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
> 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> 1003
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}
> PySpark columnSimilarities documentation
> [http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23612) Specify formats for individual DateType and TimestampType columns in schemas

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23612.
--
Resolution: Incomplete

> Specify formats for individual DateType and TimestampType columns in schemas
> 
>
> Key: SPARK-23612
> URL: https://issues.apache.org/jira/browse/SPARK-23612
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Patrick Young
>Priority: Minor
>  Labels: DataType, bulk-closed, date, spree, sql
>
> [https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200]
> It would be very helpful if it were possible to specify the format for 
> individual columns in a schema when reading csv files, rather than one format:
> {code:java|title=Bar.python|borderStyle=solid}
> # Currently can only do something like:
> spark.read.option("dateFormat", "MMdd").csv(...) 
> # Would like to be able to do something like:
> schema = StructType([
>     StructField("date1", DateType(format="MM/dd/"), True),
>     StructField("date2", DateType(format="MMdd"), True)
> ]
> read.schema(schema).csv(...)
> {code}
> Thanks for any help, input!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23442.
--
Resolution: Incomplete

> Reading from partitioned and bucketed table uses only bucketSpec.numBuckets 
> partitions in all cases
> ---
>
> Key: SPARK-23442
> URL: https://issues.apache.org/jira/browse/SPARK-23442
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Pranav Rao
>Priority: Major
>  Labels: bulk-closed
>
> Through the DataFrameWriter[T] interface I have created a external HIVE table 
> with 5000 (horizontal) partitions and 50 buckets in each partition. Overall 
> the dataset is 600GB and the provider is Parquet.
> Now this works great when joining with a similarly bucketed dataset - it's 
> able to avoid a shuffle. 
> But any action on this Dataframe(from _spark.table("tablename")_), works with 
> only 50 RDD partitions. This is happening because of 
> [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
>  So the 600GB dataset is only read through 50 tasks, which makes this 
> partitioning + bucketing scheme not useful.
> I cannot expose the base directory of the parquet folder for reading the 
> dataset, because the partition locations don't follow a (basePath + partSpec) 
> format.
> Meanwhile, are there workarounds to use higher parallelism while reading such 
> a table? 
>  Let me know if I can help in any way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21730) Consider officially dropping PyPy pre-2.5 support

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21730.
--
Resolution: Incomplete

> Consider officially dropping PyPy pre-2.5 support
> -
>
> Key: SPARK-21730
> URL: https://issues.apache.org/jira/browse/SPARK-21730
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>
> Jenkins currently tests with PyPy 2.5+, should we consider dropping 2.3 
> support from the documentation?
> cc [~davies] [~shaneknapp]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23485) Kubernetes should support node blacklist

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23485.
--
Resolution: Incomplete

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: bulk-closed
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21443) Very long planning duration for queries with lots of operations

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21443.
--
Resolution: Incomplete

> Very long planning duration for queries with lots of operations
> ---
>
> Key: SPARK-21443
> URL: https://issues.apache.org/jira/browse/SPARK-21443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Eyal Zituny
>Priority: Minor
>  Labels: bulk-closed
>
> Creating a streaming query with large amount of operations and fields (100+) 
> results in a very long query planning phase. in the example bellow, the plan 
> phase has taken 35 seconds while the actual batch execution took only 1.3 
> second.
>  after some investigation, i have found out that the root causes of this are 
> 2 optimizer rules which seems to take most of the planning time: 
> InferFiltersFromConstraints and PruneFilters
> I would suggest the following:
>  # fix the inefficient optimizer rules
>  # add warn level logging if a rule has taken more then xx ms
>  # allow custom removing of optimizer rules (opposite to 
> spark.experimental.extraOptimizations)
>  # reuse query plans (optional) where possible
> reproducing this issue can be done with the bellow script which simulates the 
> scenario:
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.execution.streaming.MemoryStream
> import 
> org.apache.spark.sql.streaming.StreamingQueryListener.{QueryProgressEvent, 
> QueryStartedEvent, QueryTerminatedEvent}
> import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQueryListener}
> case class Product(pid: Long, name: String, price: Long, ts: Long = 
> System.currentTimeMillis())
> case class Events (eventId: Long, eventName: String, productId: Long) {
>   def this(id: Long) = this(id, s"event$id", id%100)
> }
> object SparkTestFlow {
>   def main(args: Array[String]): Unit = {
>   val spark = SparkSession
> .builder
> .appName("TestFlow")
> .master("local[8]")
> .getOrCreate()
>   spark.sqlContext.streams.addListener(new StreamingQueryListener 
> {
>   override def onQueryTerminated(event: 
> QueryTerminatedEvent): Unit = {}
>   override def onQueryProgress(event: 
> QueryProgressEvent): Unit = {
>   if (event.progress.numInputRows>0) {
>   println(event.progress.toString())
>   }
>   }
>   override def onQueryStarted(event: QueryStartedEvent): 
> Unit = {}
>   })
>   
>   import spark.implicits._
>   implicit val  sclContext = spark.sqlContext
>   import org.apache.spark.sql.functions.expr
>   val seq = (1L to 100L).map(i => Product(i, s"name$i", 10L*i))
>   val lookupTable = spark.createDataFrame(seq)
>   val inputData = MemoryStream[Events]
>   inputData.addData((1L to 100L).map(i => new Events(i)))
>   val events = inputData.toDF()
> .withColumn("w1", expr("0"))
> .withColumn("x1", expr("0"))
> .withColumn("y1", expr("0"))
> .withColumn("z1", expr("0"))
>   val numberOfSelects = 40 // set to 100+ and the planning takes 
> forever
>   val dfWithSelectsExpr = (2 to 
> numberOfSelects).foldLeft(events)((df,i) =>{
>   val arr = df.columns.++(Array(s"w${i-1} + rand() as 
> w$i", s"x${i-1} + rand() as x$i", s"y${i-1} + 2 as y$i", s"z${i-1} +1 as 
> z$i"))
>   df.selectExpr(arr:_*)
>   })
>   val withJoinAndFilter = dfWithSelectsExpr
> .join(lookupTable, expr("productId = pid"))
> .filter("productId < 50")
>   val query = withJoinAndFilter.writeStream
> .outputMode("append")
> .format("console")
> .trigger(ProcessingTime(2000))
> .start()
>   query.processAllAvailable()
>   spark.stop()
>   }
> }
> {code}
> the query progress output will show:
> {code:java}
> "durationMs" : {
> "addBatch" : 1310,
> "getBatch" : 6,
> "getOffset" : 0,
> "*queryPlanning*" : 36924,
> "triggerExecution" : 38297,
> "walCommit" : 33
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24946) PySpark - Allow np.Arrays and pd.Series in df.approxQuantile

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24946.
--
Resolution: Incomplete

> PySpark - Allow np.Arrays and pd.Series in df.approxQuantile
> 
>
> Key: SPARK-24946
> URL: https://issues.apache.org/jira/browse/SPARK-24946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Paul Westenthanner
>Priority: Minor
>  Labels: DataFrame, beginner, bulk-closed, pyspark
>
> As Python user it is convenient to pass a numpy array or pandas series 
> `{{approxQuantile}}(_col_, _probabilities_, _relativeError_)` for the 
> probabilities parameter. 
>  
> Especially for creating cumulative plots (say in 1% steps) it is handy to use 
> `approxQuantile(col, np.arange(0, 1.0, 0.01), relativeError)`.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23833) Incorrect primitive type check for input arguments of udf

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23833.
--
Resolution: Incomplete

> Incorrect primitive type check for input arguments of udf
> -
>
> Key: SPARK-23833
> URL: https://issues.apache.org/jira/browse/SPARK-23833
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Valentin Nikotin
>Priority: Major
>  Labels: bulk-closed
>
> There is claimed behavior for scala UDFs with primitive type arguments:
> {quote}Note that if you use primitive parameters, you are not able to check 
> if it is null or not, and the UDF will return null for you if the primitive 
> input is null.
> {quote}
> This is initial issue - SPARK-11725
>  Correspondent pr - 
> [PR|https://github.com/apache/spark/pull/9770/commits/a8a30674ce531c9cd10107200a3f72f9539cd8f6]
> The problem is that {{ScalaReflection.getParameterTypes}} doesn't work 
> correctly due to type erasure. 
> The correct check "if type is primitive" should be based on typeTag something 
> like this:
> {code:java}
> typeTag[T].tpe.typeSymbol.asClass.isPrimitive
> {code}
>  
> The problem appears if we have high order functions:
> {code:java}
> val f = (x: Long) => x
> def identity[T, U](f: T => U): T => U = (t: T) => f(t)
> val udf0 = udf(f)
> val udf1 = udf(identity(f))
> val getNull = udf(() => null.asInstanceOf[java.lang.Long])
> spark.range(5).toDF().
>   withColumn("udf0", udf0(getNull())).
>   withColumn("udf1", udf1(getNull())).
>   show()
> spark.range(5).toDF().
>   withColumn("udf0", udf0(getNull())).
>   withColumn("udf1", udf1(getNull())).
>   explain()
> {code}
> Test execution on Spark 2.2 spark-shell:
> {code:java}
> scala> val f = (x: Long) => x
> f: Long => Long = 
> scala> def identity[T, U](f: T => U): T => U = (t: T) => f(t)
> identity: [T, U](f: T => U)T => U
> scala> val udf0 = udf(f)
> udf0: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val udf1 = udf(identity(f))
> udf1: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val getNull = udf(() => null.asInstanceOf[java.lang.Long])
> getNull: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List()))
> scala> spark.range(5).toDF().
>  |   withColumn("udf0", udf0(getNull())).
>  |   withColumn("udf1", udf1(getNull())).
>  |   show()
> +---+++   
>   
> | id|udf0|udf1|
> +---+++
> |  0|null|   0|
> |  1|null|   0|
> |  2|null|   0|
> |  3|null|   0|
> |  4|null|   0|
> +---+++
> scala> spark.range(5).toDF().
>  |   withColumn("udf0", udf0(getNull())).
>  |   withColumn("udf1", udf1(getNull())).
>  |   explain()
> == Physical Plan ==
> *Project [id#19L, if (isnull(UDF())) null else UDF(UDF()) AS udf0#24L, 
> UDF(UDF()) AS udf1#28L]
> +- *Range (0, 5, step=1, splits=6)
> {code}
>  
> The typeTag information about input parameters is available in udf function 
> but only used to get schema, it should be added to ScalaUDF too so that we 
> can used it later:
> {code:java}
> def udf[RT: TypeTag, A1: TypeTag, A2: TypeTag](f: Function2[A1, A2, RT]): 
> UserDefinedFunction = {
>   val inputTypes = Try(ScalaReflection.schemaFor(typeTag[A1]).dataType :: 
> ScalaReflection.schemaFor(typeTag[A2]).dataType :: Nil).toOption
>   UserDefinedFunction(f, ScalaReflection.schemaFor(typeTag[RT]).dataType, 
> inputTypes)
> }
> {code}
>  
> Here is current vs desired version:
> {code:java}
> scala> import org.apache.spark.sql.catalyst.ScalaReflection
> import org.apache.spark.sql.catalyst.ScalaReflection
> scala> ScalaReflection.getParameterTypes(identity(f))
> res2: Seq[Class[_]] = WrappedArray(class java.lang.Object)
> scala> ScalaReflection.getParameterTypes(identity(f)).map(_.isPrimitive)
> res7: Seq[Boolean] = ArrayBuffer(false)
> {code}
> versus
> {code:java}
> scala> import scala.reflect.runtime.universe.{typeTag, TypeTag}
> import scala.reflect.runtime.universe.{typeTag, TypeTag}
> scala> def myGetParameterTypes[T : TypeTag, U](func: T => U) = {
>  |   typeTag[T].tpe.typeSymbol.asClass
>  | }
> myGetParameterTypes: [T, U](func: T => U)(implicit evidence$1: 
> reflect.runtime.universe.TypeTag[T])reflect.runtime.universe.ClassSymbol
> scala> myGetParameterTypes(f)
> res3: reflect.runtime.universe.ClassSymbol = class Long
> scala> myGetParameterTypes(f).isPrimitive
> res4: Boolean = true
> {code}
> Although for this case there is workaround with using {{@specialized(Long)}}
> {code:scala}
> scala> def identity2[@specialized(Long) T, U](f: T => U): T => U = (t: T) => 
>

[jira] [Resolved] (SPARK-25215) Make PipelineModel public

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25215.
--
Resolution: Incomplete

> Make PipelineModel public
> -
>
> Key: SPARK-25215
> URL: https://issues.apache.org/jira/browse/SPARK-25215
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Nicholas Resnick
>Priority: Minor
>  Labels: ML, bulk-closed
>
> Can PipelineModel be made public? I can't think of a reason for having it be 
> private to the ml package.
> My specific use-case is that I'm creating a feature for 
> serializing/deserializing PipelineModels in a production environment, and am 
> trying to write specs for this feature, but I can't easily create 
> PipelineModels for the specs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23322) Launcher handles can miss application updates if application finishes too quickly

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23322.
--
Resolution: Incomplete

> Launcher handles can miss application updates if application finishes too 
> quickly
> -
>
> Key: SPARK-23322
> URL: https://issues.apache.org/jira/browse/SPARK-23322
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Masiero Vanzin
>Priority: Minor
>  Labels: bulk-closed
>
> This is the underlying issue in SPARK-23020, which was worked around in our 
> tests, but still exist in the code.
> If a child application finishes too quickly, the launcher code may clean up 
> the handle's state before the connection from the child has been 
> acknowledged. This means than the application handle will have a final state 
> LOST instead of whatever final state the application sent.
> This doesn't seem to affect child processes as much as the new in-process 
> launch mode, and it requires the child application to finish very quickly, 
> which should be rare for the kind of use case the launcher library covers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24560) Fix some getTimeAsMs as getTimeAsSeconds

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24560.
--
Resolution: Incomplete

> Fix some getTimeAsMs as getTimeAsSeconds
> 
>
> Key: SPARK-24560
> URL: https://issues.apache.org/jira/browse/SPARK-24560
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Priority: Major
>  Labels: bulk-closed
>
> There are some places using "getTimeAsMs" rather than "getTimeAsSeconds". 
> This will return a wrong value when the user specifies a value without a time 
> unit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22202) Release tgz content differences for python and R

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22202.
--
Resolution: Incomplete

> Release tgz content differences for python and R
> 
>
> Key: SPARK-22202
> URL: https://issues.apache.org/jira/browse/SPARK-22202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: bulk-closed
>
> As a follow up to SPARK-22167, currently we are running different 
> profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
> should consider if these differences are significant and whether they should 
> be addressed.
> A couple of things:
> - R.../doc directory is not in any release jar except hadoop 2.6
> - python/dist, python.egg-info are not in any release jar except hadoop 2.7
> - R DESCRIPTION has a few additions
> I've checked to confirm these are the same in 2.1.1 release so this isn't a 
> regression.
> {code}
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc:
> sparkr-vignettes.Rmd
> sparkr-vignettes.R
> sparkr-vignettes.html
> index.html
> Only in spark-2.1.2-bin-hadoop2.7/python: dist
> Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python
> Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION
> 25a26,27
> > NeedsCompilation: no
> > Packaged: 2017-10-03 00:42:30 UTC; holden
> 31c33
> < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix
> ---
> > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html
> 16a17
> > User guides, package vignettes and other 
> > documentation.
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20153.
--
Resolution: Incomplete

> Support Multiple aws credentials in order to access multiple Hive on S3 table 
> in spark application 
> ---
>
> Key: SPARK-20153
> URL: https://issues.apache.org/jira/browse/SPARK-20153
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Franck Tago
>Priority: Minor
>  Labels: bulk-closed
>
> I need to access multiple hive tables in my spark application where each hive 
> table is 
> 1- an external table with data sitting on S3
> 2- each table is own by a different AWS user so I need to provide different 
> AWS credentials. 
> I am familiar with setting the aws credentials in the hadoop configuration 
> object but that does not really help me because I can only set one pair of 
> (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey )
> From my research , there is no easy or elegant way to do this in spark .
> Why is that ?  
> How do I address this use case?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24456) Spark submit - server environment variables are overwritten by client environment variables

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24456.
--
Resolution: Incomplete

> Spark submit - server environment variables are overwritten by client 
> environment variables 
> 
>
> Key: SPARK-24456
> URL: https://issues.apache.org/jira/browse/SPARK-24456
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Alon Shoham
>Priority: Minor
>  Labels: bulk-closed
>
> When submitting a spark application in --deploy-mode cluster + spark 
> standalone cluster, environment variables from the client machine overwrite 
> server environment variables. 
>  
> We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
> dependencies to the application. We observed that client machine 
> SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
> application submission failure. 
>  
> We have inspected the code and found:
> 1. In org.apache.spark.deploy.Client line 86:
> {code:java}
> val command = new Command(mainClass,
>  Seq("{{WORKER_URL}}", "{{USER_JAR}}", driverArgs.mainClass) ++ 
> driverArgs.driverOptions,
>  sys.env, classPathEntries, libraryPathEntries, javaOpts){code}
> 2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:
> {code:java}
> childEnv.putAll(command.environment.asJava)
> childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome){code}
> Seen in line 35  is that the environment is overwritten in the server machine 
> but in line 36 the SPARK_HOME is restored to the server value.
> We think the bug can be fixed by adding a line that restores 
> SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17877.
--
Resolution: Incomplete

> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1, 2.3.0
>Reporter: Alexander Pivovarov
>Priority: Minor
>  Labels: bulk-closed
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint   // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed  // res5: Boolean = false,   /tmp/check still contains only 
> 1 folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23298.
--
Resolution: Incomplete

> distinct.count on Dataset/DataFrame yields non-deterministic results
> 
>
> Key: SPARK-23298
> URL: https://issues.apache.org/jira/browse/SPARK-23298
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.1.0, 2.2.0
> Environment: Spark 2.2.0 or 2.1.0
> Java 1.8.0_144
> Yarn version:
> {code:java}
> Hadoop 2.6.0-cdh5.12.1
> Subversion http://github.com/cloudera/hadoop -r 
> 520d8b072e666e9f21d645ca6a5219fc37535a52
> Compiled by jenkins on 2017-08-24T16:43Z
> Compiled with protoc 2.5.0
> From source with checksum de51bf9693ab9426379a1cd28142cea0
> This command was run using 
> /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.12.1.jar{code}
>  
>  
>Reporter: Mateusz Jukiewicz
>Priority: Major
>  Labels: Correctness, CorrectnessBug, bulk-closed, correctness
>
> This is what happens (EDIT - managed to get a reproducible example):
> {code:java}
> /* Exemplary spark-shell starting command 
> /opt/spark/bin/spark-shell \
> --num-executors 269 \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.kryoserializer.buffer.max=512m 
> // The spark.sql.shuffle.partitions is 2154 here, if that matters
> */
> val df = spark.range(1000).withColumn("col1", (rand() * 
> 1000).cast("long")).withColumn("col2", (rand() * 
> 1000).cast("long")).drop("id")
> df.repartition(5240).write.parquet("/test.parquet")
> // Then, ideally in a new session
> val df = spark.read.parquet("/test.parquet")
> df.distinct.count
> // res1: Long = 1001256                                                       
>      
> df.distinct.count
> // res2: Long = 55   {code}
> -The _text_dataset.out_ file is a dataset with one string per line. The 
> string has alphanumeric characters as well as colons and spaces. The line 
> length does not exceed 1200. I don't think that's important though, as the 
> issue appeared on various other datasets, I just tried to narrow it down to 
> the simplest possible case.- (the case is now fully reproducible with the 
> above code)
> The observations regarding the issue are as follows:
>  * I managed to reproduce it on both spark 2.2 and spark 2.1.
>  * The issue occurs in YARN cluster mode (I haven't tested YARN client mode).
>  * The issue is not reproducible on a single machine (e.g. laptop) in spark 
> local mode.
>  * It seems that once the correct count is computed, it is not possible to 
> reproduce the issue in the same spark session. In other words, I was able to 
> get 2-3 incorrect distinct.count results consecutively, but once it got 
> right, it always returned the correct value. I had to re-run spark-shell to 
> observe the problem again.
>  * The issue appears on both Dataset and DataFrame (i.e. using read.text or 
> read.textFile).
>  * The issue is not reproducible on RDD (i.e. dataset.rdd.distinct.count).
>  * Not a single container has failed in those multiple invalid executions.
>  * YARN doesn't show any warnings or errors in those invalid executions.
>  * The execution plan determined for both valid and invalid executions was 
> always the same (it's shown in the _SQL_ tab of the UI).
>  * The number returned in the invalid executions was always greater than the 
> correct number (24 014 227).
>  * This occurs even though the input is already completely deduplicated (i.e. 
> _distinct.count_ shouldn't change anything).
>  * The input isn't replicated (i.e. there's only one copy of each file block 
> on the HDFS).
>  * The problem is probably not related to reading from HDFS. Spark was always 
> able to correctly read all input records (which was shown in the UI), and 
> that number got malformed after the exchange phase:
>  ** correct execution:
>  Input Size / Records: 3.9 GB / 24014227 _(first stage)_
>  Shuffle Write: 3.3 GB / 24014227 _(first stage)_
>  Shuffle Read: 3.3 GB / 24014227 _(second stage)_
>  ** incorrect execution:
>  Input Size / Records: 3.9 GB / 24014227 _(first stage)_
>  Shuffle Write: 3.3 GB / 24014227 _(first stage)_
>  Shuffle Read: 3.3 GB / 24020150 _(second stage)_
>  * The problem might be related with the internal way of Encoders hashing. 
> The reason might be:
>  ** in a simple `distinct.count` invocation, there are in total three 
> hash-related stages (called `HashAggregate`),
>  ** excerpt from scaladoc for `distinct` method says:
> {code:java}
>* @note Equality checking is performed directly on the encoded 
> representation of the data
>* and thus is not affected by a custom `equals` function defined on 
> `T`.{code}
>  * One of my suspicions was the number of partitions we're using (2154).

[jira] [Resolved] (SPARK-25059) Exception while executing an action on DataFrame that read Json

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25059.
--
Resolution: Incomplete

> Exception while executing an action on DataFrame that read Json
> ---
>
> Key: SPARK-25059
> URL: https://issues.apache.org/jira/browse/SPARK-25059
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
> Environment: AWS EMR 5.8.0 
> Spark 2.2.0 
>  
>Reporter: Kunal Goswami
>Priority: Major
>  Labels: Spark-SQL, bulk-closed
>
> When I try to read ~9600 Json files using
> {noformat}
> val test = spark.read.option("header", true).option("inferSchema", 
> true).json(paths: _*) {noformat}
>  
> Any action on the above created data frame results in: 
> {noformat}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply2_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class "org.apache.spark.sql.catalyst.expressions.Generat[73/1850]
> pecificUnsafeProjection" grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:949)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:839)
>   at org.codehaus.janino.UnitCompiler.writeOpcode(UnitCompiler.java:11081)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4546)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436)
>   at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2471)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220)
>   at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$IfStatement.accept(Java.java:2621)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436)
>   at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2471)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220)
>   at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$IfStatement.accept(Java.java:2621)
>   at

[jira] [Resolved] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22105.
--
Resolution: Incomplete

> Dataframe has poor performance when computing on many columns with codegen
> --
>
> Key: SPARK-22105
> URL: https://issues.apache.org/jira/browse/SPARK-22105
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>  Labels: bulk-closed
>
> Suppose we have a dataframe with many columns (e.g 100 columns), each column 
> is DoubleType.
> And we need to compute avg on each column. We will find using dataframe avg 
> will be much slower than using RDD.aggregate.
> I observe this issue from this PR: (One pass imputer)
> https://github.com/apache/spark/pull/18902
> I also write a minimal testing code to reproduce this issue, I use computing 
> sum to reproduce this issue:
> https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1
> When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be 
> about 3x slower than `RDD.aggregate`, but if we only compute one column, 
> dataframe avg will be much faster than `RDD.aggregate`.
> The reason of this issue, should be the defact in dataframe codegen. Codegen 
> will inline everything and generate large code block. When the column number 
> is large (e.g 100 columns), the codegen size will be too large, which cause 
> jvm failed to JIT and fall back to byte code interpretation.
> This PR should address this issue:
> https://github.com/apache/spark/pull/19082
> But we need more performance test against some code in ML after above PR 
> merged, to check whether this issue is actually fixed.
> This JIRA used to track this performance issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22823) Race Condition when reading Broadcast shuffle input. Failed to get broadcast piece

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22823.
--
Resolution: Incomplete

> Race Condition when reading Broadcast shuffle input. Failed to get broadcast 
> piece
> --
>
> Key: SPARK-22823
> URL: https://issues.apache.org/jira/browse/SPARK-22823
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.1, 2.2.1, 2.3.0
>Reporter: Dmitrii Bundin
>Priority: Major
>  Labels: bulk-closed
>
> It seems we have a race condition when trying to read shuffle input which is 
> a broadcast, not direct. To read broadcast MapStatuses at 
> {code:java}
> org.apache.spark.shuffle.BlockStoreShuffleReader::read()
> {code}
> we submit a message of the type GetMapOutputStatuses(shuffleId) to be 
> executed in MapOutputTrackerMaster's pool which in turn ends up in creating a 
> new broadcast at
> {code:java}
> org.apache.spark.MapOutputTracker::serializeMapStatuses
> {code}
> if the received statuses bytes more than minBroadcastSize.
> So registering the newly created broadcast in the driver's 
> BlockManagerMasterEndpoint may appear later than some executor asks for the 
> broadcast piece location from the driver.
> In out project we get the following exception on the regular basis:
> {code:java}
> java.io.IOException: org.apache.spark.SparkException: Failed to get 
> broadcast_176_piece0 of broadcast_176
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1280)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:661)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:661)
> at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
> at 
> org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:598)
> at 
> org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:660)
> at 
> org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:203)
> at 
> org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:142)
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to get 
> broadcast_176_piece0 of broadcast_176
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:146)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:125)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:186)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1273)
> {code}
> This exception is appeared when we try to read a broadcast piece. To do this 
> we need to fetch the broadcast piece location from the driver 
> {code:java}
> org.apache.spark.storage.BlockManagerMaster::getLocations(blockId: BlockId)
> {code}
> . The

[jira] [Resolved] (SPARK-22911) Migrate structured streaming sources to new DataSourceV2 APIs

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22911.
--
Resolution: Incomplete

> Migrate structured streaming sources to new DataSourceV2 APIs
> -
>
> Key: SPARK-22911
> URL: https://issues.apache.org/jira/browse/SPARK-22911
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24608) report number of iteration/progress for ML training

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24608.
--
Resolution: Incomplete

> report number of iteration/progress for ML training
> ---
>
> Key: SPARK-24608
> URL: https://issues.apache.org/jira/browse/SPARK-24608
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.1
>Reporter: R
>Priority: Major
>  Labels: bulk-closed
>
> Debugging big ML models requires careful control of resources (memory, 
> storage, CPU, progress, etc). Current ML training reports no progress. It 
> would be ideal to be more verbose during training



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24640) size(null) returns null

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24640.
--
Resolution: Incomplete

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>  Labels: api, bulk-closed
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24095) Spark Streaming performance drastically drops when when saving dataframes with withColumn

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24095.
--
Resolution: Incomplete

> Spark Streaming performance drastically drops when when saving dataframes 
> with withColumn
> -
>
> Key: SPARK-24095
> URL: https://issues.apache.org/jira/browse/SPARK-24095
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: brian wang
>Priority: Major
>  Labels: bulk-closed
>
> We have a Spark Streaming application which is streaming data from Kafka and 
> ingesting the data in HDFS after a series of transformations. We are using 
> Spark SQL to do the transformations and storing the data into HDFS at two 
> stages. The ingestion to Spark which we do at the second stage is drastically 
> reducing the performance of the application.
> There are close to 40 Million transactions per hour in the incoming data. WE 
> have observed a performance bottleneck in the write to hdfs.
> Can you please help us optimize the application performance?
> This is a critical issue since it is holding our deployment to production 
> cluster and we are running behind the schedule in production deployment.
>  
> Answer: First Stage Save
> test_Transformed_DOW.cache().withColumn("test_class_map", udf(test_class_map, 
> StringType())(array(test_class))).write.mode("append").option("header","true").csv("/hive/warehouse/test")
> Second Stage Save
> test_Data_Final=spark.sql("select test1,test2,test3.. when int(seats)>=2 
> then 1 when int(seats) < 2 then 0 end as seats from 
> test_Data_Unpivoted").write.format("parquet").mode("append").saveAsTable("test_Data_Output")
> It is the first save stage which is slowing our spark application's 
> performance if we enable it. If we disable it, the application seems to catch 
> up with the incoming data flow.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23563) make the size fo cache in CodeGenerator configable

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23563.
--
Resolution: Incomplete

> make the size fo cache in CodeGenerator configable
> --
>
> Key: SPARK-23563
> URL: https://issues.apache.org/jira/browse/SPARK-23563
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kejiqing
>Priority: Minor
>  Labels: bulk-closed
>
> the cache in class 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator has a hard 
> cod maxmunSize 100, current code is:
>  
> {code:java}
> // scala
> private val cache = CacheBuilder.newBuilder()
>   .maximumSize(100)
>   .build(
> new CacheLoader[CodeAndComment, (GeneratedClass, Int)]() {
>   override def load(code: CodeAndComment): (GeneratedClass, Int) = {
> val startTime = System.nanoTime()
> val result = doCompile(code)
> val endTime = System.nanoTime()
> def timeMs: Double = (endTime - startTime).toDouble / 100
> CodegenMetrics.METRIC_SOURCE_CODE_SIZE.update(code.body.length)
> CodegenMetrics.METRIC_COMPILATION_TIME.update(timeMs.toLong)
> logInfo(s"Code generated in $timeMs ms")
> result
>   }
> })
> {code}
>  In some specific situation, for example: a long term and spark tasks are 
> unchanged,  the size of cache maximumSize configuration is a better idea.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24342.
--
Resolution: Incomplete

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Major
>  Labels: bulk-closed
> Attachments: tasktimespan.PNG
>
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24457) Performance improvement while converting stringToTimestamp in DateTimeUtils

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24457.
--
Resolution: Incomplete

> Performance improvement while converting stringToTimestamp in DateTimeUtils
> ---
>
> Key: SPARK-24457
> URL: https://issues.apache.org/jira/browse/SPARK-24457
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sharad Sonker
>Priority: Minor
>  Labels: bulk-closed
>
> stringToTimestamp in DateTimeUtils creates Calendar instance for each input 
> row even if the input timezone is same. This can be improved by caching the 
> calendar instance for each input timezone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20074) Make buffer size in unsafe external sorter configurable

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20074.
--
Resolution: Incomplete

> Make buffer size in unsafe external sorter configurable
> ---
>
> Key: SPARK-20074
> URL: https://issues.apache.org/jira/browse/SPARK-20074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Sital Kedia
>Priority: Major
>  Labels: bulk-closed
>
> Currently, it is hardcoded to 32kb, see - 
> https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L123



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25480) Dynamic partitioning + saveAsTable with multiple partition columns create empty directory

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25480.
--
Resolution: Incomplete

> Dynamic partitioning + saveAsTable with multiple partition columns create 
> empty directory
> -
>
> Key: SPARK-25480
> URL: https://issues.apache.org/jira/browse/SPARK-25480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Daniel Mateus Pires
>Priority: Minor
>  Labels: bulk-closed
> Attachments: dynamic_partitioning.json
>
>
> We use .saveAsTable and dynamic partitioning as our only way to write data to 
> S3 from Spark.
> When only 1 partition column is defined for a table, .saveAsTable behaves as 
> expected:
>  - with Overwrite mode it will create a table if it doesn't exist and write 
> the data
>  - with Append mode it will append to a given partition
>  - with Overwrite mode if the table exists it will overwrite the partition
> If 2 partition columns are used however, the directory is created on S3 with 
> the SUCCESS file, but no data is actually written
> our solution is to check if the table doesn't exist, and in that case, set 
> the partitioning mode back to static before running saveAsTable:
> {code}
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
> df.write.mode("overwrite").partitionBy("year", "month").option("path", 
> "s3://hbc-data-warehouse/integration/users_test").saveAsTable("users_test")
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23994) Add Host To Blacklist If Shuffle Cannot Complete

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23994.
--
Resolution: Incomplete

> Add Host To Blacklist If Shuffle Cannot Complete
> 
>
> Key: SPARK-23994
> URL: https://issues.apache.org/jira/browse/SPARK-23994
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Shuffle
>Affects Versions: 2.3.0
>Reporter: David Mollitor
>Priority: Major
>  Labels: bulk-closed
>
> If a node cannot be reached for shuffling data, add the node to the blacklist 
> and retry the current stage.
> {code:java}
> 2018-04-10 20:25:55,065 ERROR [Block Fetch Retry-3] 
> shuffle.RetryingBlockFetcher 
> (RetryingBlockFetcher.java:fetchAllOutstanding(142)) - Exception while 
> beginning fetch of 711 outstanding blocks (after 3 retries)
> java.io.IOException: Failed to connect to host.local/10.11.12.13:7337
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.ConnectException: Connection refused: 
> host.local/10.11.12.13:7337
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>   at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   ... 1 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24474) Cores are left idle when there are a lot of tasks to run

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24474.
--
Resolution: Incomplete

> Cores are left idle when there are a lot of tasks to run
> 
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>  Labels: bulk-closed
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The stage that hangs (referred to as "Second stage" above) has a lower 
> 'Stage Id' than the first one that completes
>  * This happens with spark.shuffle.service.enabled set to true and false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25125.
--
Resolution: Incomplete

> Spark SQL percentile_approx takes longer than Hive version for large datasets
> -
>
> Key: SPARK-25125
> URL: https://issues.apache.org/jira/browse/SPARK-25125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Mir Ali
>Priority: Major
>  Labels: bulk-closed
>
> The percentile_approx function in Spark SQL takes much longer than the 
> previous Hive implementation for large data sets (7B rows grouped into 200k 
> buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark 
> 2.1.0.
> The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, 
> this does not finish at all in more than 2 hours. Also tried this with 
> different accuracy values 5000,1000,500, the timing does get better with 
> smaller datasets with the new version, but the speed difference is 
> insignificant
>  
> Infrastructure used:
> AWS EMR -> Spark 2.1.0
> vs
> AWS EMR  -> Spark 2.3.1
>  
> spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g 
> --conf spark.sql.shuffle.partitions=2000 --conf 
> spark.default.parallelism=2000 --num-executors=75 --executor-cores=2
> {code:java}
> import org.apache.spark.sql.functions._ 
> import org.apache.spark.sql.types._ 
> val df=spark.range(70L).withColumn("some_grouping_id", 
> round(rand()*20L).cast(LongType)) 
> df.createOrReplaceTempView("tab")   
> val percentile_query = """ select some_grouping_id, percentile_approx(id, 
> array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ 
> spark.sql(percentile_query).collect()
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 521 matches

Mail list logo