[jira] [Created] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
angerszhu created SPARK-29379: - Summary: SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' Key: SPARK-29379 URL: https://issues.apache.org/jira/browse/SPARK-29379 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946503#comment-16946503 ] Maxim Gekk commented on SPARK-24640: As far as I remember we planed to remove spark.sql.legacy.sizeOfNull in 3.0. [~hyukjin.kwon] [~smilegator] This ticket is a remainder of this. > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > Labels: api, bulk-closed > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25008) Add memory mode info to showMemoryUsage in TaskMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-25008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25008. -- Resolution: Incomplete > Add memory mode info to showMemoryUsage in TaskMemoryManager > > > Key: SPARK-25008 > URL: https://issues.apache.org/jira/browse/SPARK-25008 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Ankur Gupta >Priority: Major > Labels: bulk-closed > > TaskMemoryManager prints the current memory usage information before throwing > an OOM exception which is helpful in debugging issues. This log does not have > the memory mode information which can be also useful to quickly determine > which memory users need to increase. > This JIRA is to add that information to showMemoryUsage method of > TaskMemoryManager. > Current logs: > {code} > 18/07/03 17:57:16 INFO memory.TaskMemoryManager: Memory used in task 318 > 18/07/03 17:57:16 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@7f084d1b: > 1024.0 KB > 18/07/03 17:57:16 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@713d50f2: 32.0 KB > 18/07/03 17:57:16 INFO memory.TaskMemoryManager: 0 bytes of memory were used > by task 318 but are not associated with specific consumers > 18/07/03 17:57:16 INFO memory.TaskMemoryManager: 1081344 bytes of memory are > used for execution and 306201016 bytes of memory are used for storage > 18/07/03 17:57:16 ERROR executor.Executor: Exception in task 86.0 in stage > 49.0 (TID 318) > java.lang.OutOfMemoryError: Unable to acquire 326284160 bytes of memory, got > 3112960 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:382) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25230) Upper behavior incorrect for string contains "ß"
[ https://issues.apache.org/jira/browse/SPARK-25230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25230. -- Resolution: Incomplete > Upper behavior incorrect for string contains "ß" > > > Key: SPARK-25230 > URL: https://issues.apache.org/jira/browse/SPARK-25230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yuming Wang >Priority: Major > Labels: bulk-closed > Attachments: MySQL.png, Oracle.png, Teradata.jpeg > > > How to reproduce: > {code:sql} > spark-sql> SELECT upper('Haßler'); > HASSLER > {code} > Mainstream databases returns {{HAßLER}}. > !MySQL.png! > > This behavior may lead to data inconsistency: > {code:sql} > create temporary view SPARK_25230 as select * from values > ("Hassler"), > ("Haßler") > as EMPLOYEE(name); > select UPPER(name) from SPARK_25230 group by 1; > -- result > HASSLER{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24074) Maven package resolver downloads javadoc instead of jar
[ https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24074. -- Resolution: Incomplete > Maven package resolver downloads javadoc instead of jar > --- > > Key: SPARK-24074 > URL: https://issues.apache.org/jira/browse/SPARK-24074 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Nadav Samet >Priority: Major > Labels: bulk-closed > > {code:java} > // code placeholder > {code} > From some reason spark downloads a javadoc artifact of a package instead of > the jar. > Steps to reproduce: > # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and > fetch artifacts from central: > {code:java} > rm -rf ~/.ivy2 > {code} > 1. Run: > {code:java} > ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages > org.scalanlp:breeze_2.11:0.13.2{code} > 2.Spark would download the javadoc instead of the jar: > {code:java} > downloading > https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar > ... > [SUCCESSFUL ] > net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar > (610ms){code} > 3. Later spark would complain that it couldn't find the jar: > {code:java} > Warning: Local jar > /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar > does not exist, skipping. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).{code} > 4. The dependency of breeze on f2j_arpack_combined seem fine: > [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25165) Cannot parse Hive Struct
[ https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25165. -- Resolution: Incomplete > Cannot parse Hive Struct > > > Key: SPARK-25165 > URL: https://issues.apache.org/jira/browse/SPARK-25165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Frank Yin >Priority: Major > Labels: bulk-closed > > org.apache.spark.SparkException: Cannot recognize hive type string: > struct,view.b:array> > > My guess is dot(.) is causing issues for parsing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22132) Document the Dispatcher REST API
[ https://issues.apache.org/jira/browse/SPARK-22132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22132. -- Resolution: Incomplete > Document the Dispatcher REST API > > > Key: SPARK-22132 > URL: https://issues.apache.org/jira/browse/SPARK-22132 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Arthur Rand >Priority: Minor > Labels: bulk-closed > > The Dispatcher has a REST API for managing jobs in a Mesos cluster but it's > currently undocumented meaning that users have to reference the source code > for programmatic access. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators
[ https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24838. -- Resolution: Incomplete > Support uncorrelated IN/EXISTS subqueries for more operators > - > > Key: SPARK-24838 > URL: https://issues.apache.org/jira/browse/SPARK-24838 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Qifan Pu >Priority: Major > Labels: bulk-closed > > Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. > Running a query: > {{select name in (select * from valid_names)}} > {{from all_names}} > returns error: > {code:java} > Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries > can only be used in a Filter > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24524) Improve aggregateMetrics: less memory usage and loops
[ https://issues.apache.org/jira/browse/SPARK-24524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24524. -- Resolution: Incomplete > Improve aggregateMetrics: less memory usage and loops > - > > Key: SPARK-24524 > URL: https://issues.apache.org/jira/browse/SPARK-24524 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Priority: Major > Labels: bulk-closed > > The function `aggregateMetrics` process metrics from both executors and > driver. The data can be large. > This PR is to improve the implementation with one loop(before converting to > string) and one dynamic data structure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10413) ML models should support prediction on single instances
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-10413. -- Resolution: Incomplete > ML models should support prediction on single instances > --- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > Labels: bulk-closed > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. > UPDATE: This issue is for making predictions with single models. We can make > methods like {{def predict(features: Vector): Double}} public. > * This issue is *not* for single-instance prediction for full Pipelines, > which would require making predictions on {{Row}}s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16203) regexp_extract to return an ArrayType(StringType())
[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-16203. -- Resolution: Incomplete > regexp_extract to return an ArrayType(StringType()) > --- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > Labels: bulk-closed > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore
[ https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23790. -- Resolution: Incomplete > proxy-user failed connecting to a kerberos configured metastore > --- > > Key: SPARK-23790 > URL: https://issues.apache.org/jira/browse/SPARK-23790 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Major > Labels: bulk-closed > > This appeared at a customer trying to integrate with a kerberized hdfs > cluster. > This can be easily fixed with the proposed fix > [here|https://github.com/apache/spark/pull/17333] and the problem was > reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for > yarn. > The other option is to add the delegation tokens to the current user's UGI as > in [here|https://github.com/apache/spark/pull/17335] . The last fixes the > problem but leads to a failure when someones uses a HadoopRDD because the > latter, uses FileInputFormat to get the splits which calls the local ticket > cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail > with: > {quote}Exception in thread "main" > org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token > can be issued only with kerberos or web authenticationat > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896) > {quote} > This implies that security mode is SIMPLE and hadoop libs there are not aware > of kerberos. > This is related to this issue the workaround decided was to > [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804] > hadoop. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19609) Broadcast joins should pushdown join constraints as Filter to the larger relation
[ https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19609. -- Resolution: Incomplete > Broadcast joins should pushdown join constraints as Filter to the larger > relation > - > > Key: SPARK-19609 > URL: https://issues.apache.org/jira/browse/SPARK-19609 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk >Priority: Major > Labels: bulk-closed > > For broadcast inner-joins, where the smaller relation is known to be small > enough to materialize on a worker, the set of values for all join columns is > known and fits in memory. Spark should translate these values into a > {{Filter}} pushed down to the datasource. The common join condition of > equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. > An example of pushing such filters is already present in the form of > {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks. > This optimization could even work when the smaller relation does not fit > entirely in memory. This could be done by partitioning the smaller relation > into N pieces, applying this predicate pushdown for each piece, and unioning > the results. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5556. - Resolution: Incomplete > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez >Priority: Major > Labels: bulk-closed > Attachments: LDA_test.xlsx, spark-summit.pptx > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23498) Accuracy problem in comparison with string and integer
[ https://issues.apache.org/jira/browse/SPARK-23498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23498. -- Resolution: Incomplete > Accuracy problem in comparison with string and integer > -- > > Key: SPARK-23498 > URL: https://issues.apache.org/jira/browse/SPARK-23498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Kevin Zhang >Priority: Major > Labels: bulk-closed > > While comparing a string column with integer value, spark sql will > automatically cast the string operant to int, the following sql will return > true in hive but false in spark > > {code:java} > select '1000.1'>1000 > {code} > > from the physical plan we can see the string operant was cast to int which > caused the accuracy loss > {code:java} > *Project [false AS (CAST(1000.1 AS INT) > 1000)#4] > +- Scan OneRowRelation[] > {code} > To solve it, using a wider common type like double to cast both sides of > operant of a binary operator may be safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24118) Support lineSep format independent from encoding
[ https://issues.apache.org/jira/browse/SPARK-24118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24118. -- Resolution: Incomplete > Support lineSep format independent from encoding > > > Key: SPARK-24118 > URL: https://issues.apache.org/jira/browse/SPARK-24118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Labels: bulk-closed > > Currently, the lineSep option of JSON datasource is depend on encoding. It is > impossible to define correct lineSep for JSON files with BOM in UTF-16 and > UTF-32 encoding, for example. Need to propose a format of lineSep which will > represent sequence of octets (bytes) and will be independent from encoding. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21016) Improve code fault tolerance for converting string to number
[ https://issues.apache.org/jira/browse/SPARK-21016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21016. -- Resolution: Incomplete > Improve code fault tolerance for converting string to number > > > Key: SPARK-21016 > URL: https://issues.apache.org/jira/browse/SPARK-21016 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Minor > Labels: bulk-closed > > Converting string to number(int, long or double), if the string has a space > before or after,we are not easy to detect,especially the user configuration. > For example > conf.setConfString(key, " 20") //has a space in string " 20" > conf.getConf(confEntry, 5) // This statement fails -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24221) Retry spark app submission to k8 in KubernetesClientApplication
[ https://issues.apache.org/jira/browse/SPARK-24221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24221. -- Resolution: Incomplete > Retry spark app submission to k8 in KubernetesClientApplication > --- > > Key: SPARK-24221 > URL: https://issues.apache.org/jira/browse/SPARK-24221 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yifei Huang >Priority: Major > Labels: bulk-closed > > Following from https://issues.apache.org/jira/browse/SPARK-24135, drivers, in > addition to executors, could suffer from init-container failures in > Kubernetes. Currently, we fail the entire application if that's the case, so > it's up to the client to detect those errors and retry. However, since both > driver and executor initialization have the same failure case, it seems like > we're repeating logic in two places. Would it be better to consolidate this > retry logic in `KubernetesClientApplication`? > We could still count executor pod initialization failures in > `KubernetesClusterSchedulerBackend` and decide what to do with the > application if there are too many failures there, but we'd be guaranteed a > set number of retries for each executor before giving up. Or would this be > too confusing and obfuscate the true number of retries? We could also > configure the number of driver and executor retries separately. It just seems > like if we're tackling init-container failure retries for executors, we > should also provide support for drivers as well since they suffer from the > same problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12014) Spark SQL query containing semicolon is broken in Beeline (related to HIVE-11100)
[ https://issues.apache.org/jira/browse/SPARK-12014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-12014. -- Resolution: Incomplete > Spark SQL query containing semicolon is broken in Beeline (related to > HIVE-11100) > - > > Key: SPARK-12014 > URL: https://issues.apache.org/jira/browse/SPARK-12014 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Teng Qiu >Priority: Minor > Labels: bulk-closed > > Actually it is known hive issue: > https://issues.apache.org/jira/browse/HIVE-11100 > patch available: https://reviews.apache.org/r/35907/diff/1 > but Spark uses its own maven dependencies for hive (org.spark-project.hive), > we can not use this patch to fix the problem, it would be better if you can > fix this in spark's hive package. > In spark's beeline, the error message will be: > {code} > 0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW > FORMAT DELIMITED FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n'; > Error: org.apache.spark.sql.AnalysisException: mismatched input '' > expecting StringLiteral near 'BY' in table row format's field separator; line > 1 pos 87 (state=,code=0) > 0: jdbc:hive2://host:1/> CREATE TABLE beeline_tb (c1 int, c2 string) ROW > FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n'; > Error: org.apache.spark.sql.AnalysisException: mismatched input '' > expecting StringLiteral near 'BY' in table row format's field separator; line > 1 pos 88 (state=,code=0) > 0: jdbc:hive2://host:1/> SELECT > str_to_map(other_data,';','=')['key_name'] FROM some_logs WHERE log_date = > '20151125' limit 5; > Error: org.apache.spark.sql.AnalysisException: cannot recognize input near > '' '' '' in select expression; line 1 pos 30 (state=,code=0) > 0: jdbc:hive2://host:1/> SELECT > str_to_map(other_data,'\;','=')['key_name'] FROM some_logs WHERE log_date = > '20151125' limit 5; > Error: org.apache.spark.sql.AnalysisException: cannot recognize input near > '' '' '' in select expression; line 1 pos 31 (state=,code=0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24051. -- Resolution: Incomplete > Incorrect results for certain queries using Java and Python APIs on Spark > 2.3.0 > --- > > Key: SPARK-24051 > URL: https://issues.apache.org/jira/browse/SPARK-24051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Emlyn Corrin >Priority: Major > Labels: bulk-closed > > I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) > query, demonstrated by the Java program below. It was simplified from a much > more complex query, but I'm having trouble simplifying it further without > removing the erroneous behaviour. > {code:java} > package sparktest; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.*; > import org.apache.spark.sql.expressions.Window; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import java.util.Arrays; > public class Main { > public static void main(String[] args) { > SparkConf conf = new SparkConf() > .setAppName("SparkTest") > .setMaster("local[*]"); > SparkSession session = > SparkSession.builder().config(conf).getOrCreate(); > Row[] arr1 = new Row[]{ > RowFactory.create(1, 42), > RowFactory.create(2, 99)}; > StructType sch1 = new StructType(new StructField[]{ > new StructField("a", DataTypes.IntegerType, true, > Metadata.empty()), > new StructField("b", DataTypes.IntegerType, true, > Metadata.empty())}); > Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1); > ds1.show(); > Row[] arr2 = new Row[]{ > RowFactory.create(3)}; > StructType sch2 = new StructType(new StructField[]{ > new StructField("a", DataTypes.IntegerType, true, > Metadata.empty())}); > Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2) > .withColumn("b", functions.lit(0)); > ds2.show(); > Column[] cols = new Column[]{ > new Column("a"), > new Column("b").as("b"), > functions.count(functions.lit(1)) > .over(Window.partitionBy()) > .as("n")}; > Dataset ds = ds1 > .select(cols) > .union(ds2.select(cols)) > .where(new Column("n").geq(1)) > .drop("n"); > ds.show(); > //ds.explain(true); > } > } > {code} > It just calculates the union of 2 datasets, > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 42| > | 2| 99| > +---+---+ > {code} > with > {code:java} > +---+---+ > | a| b| > +---+---+ > | 3| 0| > +---+---+ > {code} > The expected result is: > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 42| > | 2| 99| > | 3| 0| > +---+---+ > {code} > but instead it prints: > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 0| > | 2| 0| > | 3| 0| > +---+---+ > {code} > notice how the value in column c is always zero, overriding the original > values in rows 1 and 2. > Making seemingly trivial changes, such as replacing {{new > Column("b").as("b"),}} with just {{new Column("b"),}} or removing the > {{where}} clause after the union, make it behave correctly again. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21122) Address starvation issues when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-21122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21122. -- Resolution: Incomplete > Address starvation issues when dynamic allocation is enabled > > > Key: SPARK-21122 > URL: https://issues.apache.org/jira/browse/SPARK-21122 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.2.0, 2.3.0 >Reporter: Craig Ingram >Priority: Major > Labels: bulk-closed > Attachments: Preventing Starvation with Dynamic Allocation Enabled.pdf > > > When dynamic resource allocation is enabled on a cluster, it’s currently > possible for one application to consume all the cluster’s resources, > effectively starving any other application trying to start. This is > particularly painful in a notebook environment where notebooks may be idle > for tens of minutes while the user is figuring out what to do next (or eating > their lunch). Ideally the application should give resources back to the > cluster when monitoring indicates other applications are pending. > Before delving into the specifics of the solution. There are some workarounds > to this problem that are worth mentioning: > * Set spark.dynamicAllocation.maxExecutors to a small value, so that users > are unlikely to use the entire cluster even when many of them are doing work. > This approach will hurt cluster utilization. > * If using YARN, enable preemption and have each application (or > organization) run in a separate queue. The downside of this is that when YARN > preempts, it doesn't know anything about which executor it's killing. It > would just as likely kill a long running executor with cached data as one > that just spun up. Moreover, given a feature like > https://issues.apache.org/jira/browse/SPARK-21097 (preserving cached data on > executor decommission), YARN may not wait long enough between trying to > gracefully and forcefully shut down the executor. This would mean the blocks > that belonged to that executor would be lost and have to be recomputed. > * Configure YARN to use the capacity scheduler with multiple scheduler > queues. Put high-priority notebook users into a high-priority queue. Prevents > high-priority users from being starved out by low-priority notebook users. > Does not prevent users in the same priority class from starving each other. > Obviously any solution to this problem that depends on YARN would leave other > resource managers out in the cold. The solution proposed in this ticket will > afford spark clusters the flexibly to hook in different resource allocation > policies to fulfill their user's needs regardless of resource manager choice. > Initially the focus will be on users in a notebook environment. When > operating in a notebook environment with many users, the goal is fair > resource allocation. Given that all users will be using the same memory > configuration, this solution will focus primarily on fair sharing of cores. > The fair resource allocation policy should pick executors to remove based on > three factors initially: idleness, presence of cached data, and uptime. The > policy will favor removing executors that are idle, short-lived, and have no > cached data. The policy will only preemptively remove executors if there are > pending applications or cores (otherwise the default dynamic allocation > timeout/removal process is followed). The policy will also allow an > application's resource consumption to expand based on cluster utilization. > For example if there are 3 applications running but 2 of them are idle, the > policy will allow a busy application with pending tasks to consume more than > 1/3rd of the the cluster's resources. > More complexity could be added to take advantage of task/stage metrics, > histograms, and heuristics (i.e. favor removing executors running tasks that > are quick). The important thing here is to benchmark effectively before > adding complexity so we can measure the impact of the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20624) Add better handling for node shutdown
[ https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20624. -- Resolution: Incomplete > Add better handling for node shutdown > - > > Key: SPARK-20624 > URL: https://issues.apache.org/jira/browse/SPARK-20624 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Holden Karau >Priority: Minor > Labels: bulk-closed > > While we've done some good work with better handling when Spark is choosing > to decommission nodes (SPARK-7955), it might make sense in environments where > we get preempted without our own choice (e.g. YARN over-commit, EC2 spot > instances, GCE Preemptiable instances, etc.) to do something for the data on > the node (or at least not schedule any new tasks). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22963) Make failure recovery global and automatic for continuous processing.
[ https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22963. -- Resolution: Incomplete > Make failure recovery global and automatic for continuous processing. > - > > Key: SPARK-22963 > URL: https://issues.apache.org/jira/browse/SPARK-22963 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres >Priority: Major > Labels: bulk-closed > > Spark native task restarts don't work well for continuous processing. They > will process all data from the task's original start offset - even data which > has already been committed. This is not semantically incorrect under at least > once semantics, but it's awkward and bad. > Fortunately, they're also not necessary; the central coordinator can restart > every task from the checkpointed offsets without losing much. So we should > make that happen automatically on task failures instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24733) Dataframe saved to parquet can have different metadata then the resulting parquet file
[ https://issues.apache.org/jira/browse/SPARK-24733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24733. -- Resolution: Incomplete > Dataframe saved to parquet can have different metadata then the resulting > parquet file > -- > > Key: SPARK-24733 > URL: https://issues.apache.org/jira/browse/SPARK-24733 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: David Herskovics >Priority: Minor > Labels: bulk-closed > > See the repro using spark-shell below: > Let's say that we have a dataframe called *df_with_metadata* which has column > *name* with metadata. > > {code:scala} > scala> df_with_metadata.schema.json // Check that we have the metadata here. > scala> df_with_metadata.createOrReplaceTempView("input") > scala> val df2 = spark.sql("select case when true then name else null end as > name from input") > scala> df2.schema.json // We don't have the metadata anymore. > scala> df2.write.parquet("no_metadata_expected") > scala> val df3 = spark.read.parquet("no_metadata_expected") > scala> df3.schema.json // And the metadata is there again so the > no_metadata_expected does have metadata. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21812) PySpark ML Models should not depend transfering params from Java
[ https://issues.apache.org/jira/browse/SPARK-21812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21812. -- Resolution: Incomplete > PySpark ML Models should not depend transfering params from Java > > > Key: SPARK-21812 > URL: https://issues.apache.org/jira/browse/SPARK-21812 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > > After SPARK-10931 we should fix this so that the Python parameters are > explicitly defined instead of relying on copying them from Java. This can be > done in batches of models as sub issues since the number of params to be > update could be quite large. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24845) spark distribution generate exception while locally worked correctly
[ https://issues.apache.org/jira/browse/SPARK-24845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24845. -- Resolution: Incomplete > spark distribution generate exception while locally worked correctly > > > Key: SPARK-24845 > URL: https://issues.apache.org/jira/browse/SPARK-24845 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.1.3 > Environment: _I set spark.driver.extraClassPath_ and > _spark.executor.extraClassPath_ environment per machine in > spark-defaults.conf file as: > {{/opt/spark/jars/*:/opt/hbase/lib/commons-collections-3.2.2.jar:/opt/hbase/lib/commons-httpclient-3.1.jar:/opt/hbase/lib/findbugs-annotations-1.3.9-1.jar:/opt/hbase/lib/hbase-annotations-1.2.6.jar:/opt/hbase/lib/hbase-annotations-1.2.6-tests.jar:/opt/hbase/lib/hbase-client-1.2.6.jar:/opt/hbase/lib/hbase-common-1.2.6.jar:/opt/hbase/lib/hbase-common-1.2.6-tests.jar:/opt/hbase/lib/hbase-examples-1.2.6.jar:/opt/hbase/lib/hbase-external-blockcache-1.2.6.jar:/opt/hbase/lib/hbase-hadoop2-compat-1.2.6.jar:/opt/hbase/lib/hbase-hadoop-compat-1.2.6.jar:/opt/hbase/lib/hbase-it-1.2.6.jar:/opt/hbase/lib/hbase-it-1.2.6-tests.jar:/opt/hbase/lib/hbase-prefix-tree-1.2.6.jar:/opt/hbase/lib/hbase-procedure-1.2.6.jar:/opt/hbase/lib/hbase-protocol-1.2.6.jar:/opt/hbase/lib/hbase-resource-bundle-1.2.6.jar:/opt/hbase/lib/hbase-rest-1.2.6.jar:/opt/hbase/lib/hbase-server-1.2.6.jar:/opt/hbase/lib/hbase-server-1.2.6-tests.jar:/opt/hbase/lib/hbase-shell-1.2.6.jar:/opt/hbase/lib/hbase-thrift-1.2.6.jar:/opt/hbase/lib/jetty-util-6.1.26.jar:/opt/hbase/lib/ruby/hbase:/opt/hbase/lib/ruby/hbase/hbase.rb:/opt/hbase/lib/ruby/hbase.rb:/opt/hbase/lib/protobuf-java-2.5.0.jar:/opt/hbase/lib/metrics-core-2.2.0.jar:/opt/hbase/lib/htrace-core-3.1.0-incubating.jar:/opt/hbase/lib/guava-12.0.1.jar:/opt/hbase/lib/asm-3.1.jar:/opt/hbase/lib/Cdrpackage.jar:/opt/hbase/lib/commons-daemon-1.0.13.jar:/opt/hbase/lib/commons-el-1.0.jar:/opt/hbase/lib/commons-math-2.2.jar:/opt/hbase/lib/disruptor-3.3.0.jar:/opt/hbase/lib/jamon-runtime-2.4.1.jar:/opt/hbase/lib/jasper-compiler-5.5.23.jar:/opt/hbase/lib/jasper-runtime-5.5.23.jar:/opt/hbase/lib/jaxb-impl-2.2.3-1.jar:/opt/hbase/lib/jcodings-1.0.8.jar:/opt/hbase/lib/jersey-core-1.9.jar:/opt/hbase/lib/jersey-guice-1.9.jar:/opt/hbase/lib/jersey-json-1.9.jar:/opt/hbase/lib/jettison-1.3.3.jar:/opt/hbase/lib/jetty-sslengine-6.1.26.jar:/opt/hbase/lib/joni-2.1.2.jar:/opt/hbase/lib/jruby-complete-1.6.8.jar:/opt/hbase/lib/jsch-0.1.42.jar:/opt/hbase/lib/jsp-2.1-6.1.14.jar:/opt/hbase/lib/junit-4.12.jar:/opt/hbase/lib/servlet-api-2.5-6.1.14.jar:/opt/hbase/lib/servlet-api-2.5.jar:/opt/hbase/lib/spymemcached-2.11.6.jar:/opt/hive-hbase//opt/hive-hbase/hive-hbase-handler-2.0.1.jar}} >Reporter: Hossein Vatani >Priority: Major > Labels: bulk-closed > Original Estimate: 1h > Remaining Estimate: 1h > > we tried to read HBase table data with a distributed spark on three servers. > OS: ubuntu 14.04 > hadoop 2.7.3 > hbase 1.2.6 > first I lunch spark shell with +spark-shell --master spark://master:7077+ > command and run: > _{color:#707070}import org.apache.hadoop.hbase.util.Bytes > import org.apache.hadoop.hbase.client.{HBaseAdmin, Result, Put, HTable} > import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor, > HColumnDescriptor } > import org.apache.hadoop.hbase.mapreduce.TableInputFormat > import org.apache.hadoop.hbase.io.ImmutableBytesWritable > import org.apache.hadoop.hbase.client.TableDescriptor > import org.apache.spark._ > import org.apache.spark.rdd.NewHadoopRDD > import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} > import org.apache.hadoop.hbase.client.HBaseAdmin > import org.apache.hadoop.hbase.mapreduce.TableInputFormat > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.hbase.HColumnDescriptor > import org.apache.hadoop.hbase.util.Bytes > import org.apache.hadoop.hbase.client.Put; > import org.apache.hadoop.hbase.client.HTable; > import org.apache.hadoop.conf.Configuration > import scala.collection.JavaConverters._ > val conf = HBaseConfiguration.create() > val tablename = "default:Table1" > conf.set(TableInputFormat.INPUT_TABLE,tablename) > val admin = new HBaseAdmin(conf) > admin.isTableAvailable(tablename) <-- it return true, it > val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], > classOf[ImmutableBytesWritable], classOf[Result]) > hBaseRDD.count(){color}_ > and it generated below: > *{color:#f79232}java.lang.IllegalStateException: unread block data > at > java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2776) > at
[jira] [Resolved] (SPARK-22780) make insert commands have real children to fix UI issues
[ https://issues.apache.org/jira/browse/SPARK-22780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22780. -- Resolution: Incomplete > make insert commands have real children to fix UI issues > > > Key: SPARK-22780 > URL: https://issues.apache.org/jira/browse/SPARK-22780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24650) GroupingSet
[ https://issues.apache.org/jira/browse/SPARK-24650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24650. -- Resolution: Incomplete > GroupingSet > --- > > Key: SPARK-24650 > URL: https://issues.apache.org/jira/browse/SPARK-24650 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 > Environment: CDH 5.X, Spark 2.3 >Reporter: Mihir Sahu >Priority: Major > Labels: Grouping, Sets, bulk-closed > > If a grouping set is used in spark sql, then the plan does not perform > optimally. > If input to a grouping set is X rows and the grouping sets has y group, then > the number of rows that are processed is currently x*y rows. > Example : Let a Dataframe have col1, col2, col3 and col4 columns and number > of row be rowNo. > and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2 > Number of row processed in such case is 3*(rowNos * size of each row). > However is this the optimal way of processing data. > If the groups of y are derivable for each other, can we reduce the amount of > volume processed by removing columns as we progress to the lower dimension of > processing. > Currently while doing processing percentile, a lot of data seems to be > processed causing performance issue. > Need to look if this can be optimised -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24623) Hadoop - Spark Cluster - Python XGBoost - Not working in distributed mode
[ https://issues.apache.org/jira/browse/SPARK-24623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24623. -- Resolution: Incomplete > Hadoop - Spark Cluster - Python XGBoost - Not working in distributed mode > - > > Key: SPARK-24623 > URL: https://issues.apache.org/jira/browse/SPARK-24623 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 > Environment: Hadoop - Hortonworks Cluster > > Total Nodes - 18 > Worker Nodes - 13 >Reporter: Abhishek Reddy Chamakura >Priority: Major > Labels: bulk-closed > > Hi > We recently installed python on the Hadoop cluster with lot of data science > python modules including xgboost , spicy , scikit learn , pandas > Using pyspark the data scientists are able to test there scoring models in > the distributed mode on the Hadoop cluster. But with python - xgboost the > pyspark job is not getting distributed and it is trying to run only on one > instance. > we are trying to achieve the distributed mode when using python xgboost via > pyspark. > It would be a great help if you can direct me on how to achieve this. > Thanks, > Abhishek -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24265) lintr checks not failing PR build
[ https://issues.apache.org/jira/browse/SPARK-24265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24265. -- Resolution: Incomplete > lintr checks not failing PR build > - > > Key: SPARK-24265 > URL: https://issues.apache.org/jira/browse/SPARK-24265 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0, 2.3.1 >Reporter: Felix Cheung >Priority: Major > Labels: bulk-closed > > a few lintr violations went through recently, need to check why they are not > flagged by Jenkins build -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries
[ https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19498. -- Resolution: Incomplete > Discussion: Making MLlib APIs extensible for 3rd party libraries > > > Key: SPARK-19498 > URL: https://issues.apache.org/jira/browse/SPARK-19498 > Project: Spark > Issue Type: Brainstorming > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Critical > Labels: bulk-closed > > Per the recent discussion on the dev list, this JIRA is for discussing how we > can make MLlib DataFrame-based APIs more extensible, especially for the > purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs > (for custom Transformers, Estimators, etc.). > * For people who have written such libraries, what issues have you run into? > * What APIs are not public or extensible enough? Do they require changes > before being made more public? > * Are APIs for non-Scala languages such as Java and Python friendly or > extensive enough? > The easy answer is to make everything public, but that would be terrible of > course in the long-term. Let's discuss what is needed and how we can present > stable, sufficient, and easy-to-use APIs for 3rd-party developers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8696) Streaming API for Online LDA
[ https://issues.apache.org/jira/browse/SPARK-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8696. - Resolution: Incomplete > Streaming API for Online LDA > > > Key: SPARK-8696 > URL: https://issues.apache.org/jira/browse/SPARK-8696 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Priority: Major > Labels: bulk-closed > > Streaming LDA can be a natural extension from online LDA. > Yet for now we need to settle down the implementation for LDA prediction, to > support the predictOn method in the streaming version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8767) Abstractions for InputColParam, OutputColParam
[ https://issues.apache.org/jira/browse/SPARK-8767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8767. - Resolution: Incomplete > Abstractions for InputColParam, OutputColParam > -- > > Key: SPARK-8767 > URL: https://issues.apache.org/jira/browse/SPARK-8767 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > Original Estimate: 120h > Remaining Estimate: 120h > > I'd like to create Param subclasses for output and input columns. These will > provide easier schema checking, which could even be done automatically in an > abstraction rather than in each class. That should simplify things for > developers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24016) Yarn does not update node blacklist in static allocation
[ https://issues.apache.org/jira/browse/SPARK-24016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24016. -- Resolution: Incomplete > Yarn does not update node blacklist in static allocation > > > Key: SPARK-24016 > URL: https://issues.apache.org/jira/browse/SPARK-24016 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > Labels: bulk-closed > > Task-based blacklisting keeps track of bad nodes, and updates YARN with that > set of nodes so that Spark will not receive more containers on that node. > However, that only happens with dynamic allocation. Though its far more > important with dynamic allocation, even with static allocation this matters; > if executors die, or if the cluster was too busy at the original resource > request to give all the containers, the spark application will add new > containers in the middle. And we want an updated node blacklist for that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25232) Support Full-Text Search in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-25232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25232. -- Resolution: Incomplete > Support Full-Text Search in Spark SQL > - > > Key: SPARK-25232 > URL: https://issues.apache.org/jira/browse/SPARK-25232 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: Lijie Xu >Priority: Major > Labels: bulk-closed > > Full-text search (i.e., keyword search) is widely used in search engines and > relational databases such as MATCH() ... AGAINST operator in MySQL > (https://dev.mysql.com/doc/en/fulltext-search.html), Text query in Oracle > (https://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm#g1016054), > and text search in PostgreSQL > (https://www.postgresql.org/docs/9.5/static/textsearch.html). However, it is > not natively supported in Spark SQL. We propose an approach to implement this > full-text search in Spark SQL. > Our proposed approach is detailed at > [https://github.com/JerryLead/Misc/blob/master/FullTextSearch/Full-text-issue-2018.pdf] > and the prototype is available at > [https://github.com/bigdata-iscas/SparkFullTextQuery/tree/like_explorer] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch
[ https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24106. -- Resolution: Incomplete > Spark Structure Streaming with RF model taking long time in processing > probability for each mini batch > -- > > Key: SPARK-24106 > URL: https://issues.apache.org/jira/browse/SPARK-24106 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0, 2.2.1, 2.3.0 > Environment: Spark yarn / Standalone cluster > 2 master nodes - 32 cores - 124 GB > 9 worker nodes - 32 cores - 124 GB > Kafka input and output topic with 6 partition >Reporter: Tamilselvan Veeramani >Priority: Major > Labels: bulk-closed, performance > > RandomForestClassificationModel broadcasted to executors for every mini batch > in spark streaming while try to find probability > RF model size 45MB > spark kafka streaming job jar size 8 MB (including kafka dependency’s) > following log show model broad cast to executors for every mini batch when we > call rf_model.transform(dataset).select("probability"). > due to which task deserialization time also increases comes to 6 to 7 second > for 45MB of rf model, although processing time is just 400 to 600 ms for mini > batch > 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: > KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) > 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106 > 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory > on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) > After 2 to 3 weeks of struggle, I found a potentially solution which will > help many people who is looking to use RF model for “probability” in real > time streaming context > Since RandomForestClassificationModel class of transformImpl method > implements only “prediction” in current version of spark. Which can be > leveraged to implement “probability” also in RandomForestClassificationModel > class of transformImpl method. > I have modified the code and implemented in our server and it’s working as > fast as 400ms to 500ms for every mini batch > I see many people our there facing this issue and no solution provided in any > of the forums, Can you please review and put this fix in next release ? thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24568) Code refactoring for DataType equalsXXX methods
[ https://issues.apache.org/jira/browse/SPARK-24568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24568. -- Resolution: Incomplete > Code refactoring for DataType equalsXXX methods > --- > > Key: SPARK-24568 > URL: https://issues.apache.org/jira/browse/SPARK-24568 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wei Xue >Priority: Major > Labels: bulk-closed > > Right now there is a lot of code duplication between all DataType equalsXXX > methods: {{equalsIgnoreNullability}}, {{equalsIgnoreCaseAndNullability}}, > {{equalsIgnoreCaseAndNullability}}, {{equalsStructurally}}. We can replace > the dup code with a helper function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists
[ https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24862. -- Resolution: Incomplete > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists > > > Key: SPARK-24862 > URL: https://issues.apache.org/jira/browse/SPARK-24862 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Antonio Murgia >Priority: Major > Labels: bulk-closed > > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists. > For example if I create a case class with multiple constructor argument lists: > {code:java} > case class Multi(x: String)(y: Int){code} > Scala creates a product with arity 1, while if I apply > {code:java} > Encoders.product[Multi].schema.printTreeString{code} > I get > {code:java} > root > |-- x: string (nullable = true) > |-- y: integer (nullable = false){code} > That is not consistent and leads to: > {code:java} > Error while encoding: java.lang.RuntimeException: Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682) > at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) > at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692) > at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:373) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:410) > at >
[jira] [Resolved] (SPARK-24461) Snapshot Cache
[ https://issues.apache.org/jira/browse/SPARK-24461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24461. -- Resolution: Incomplete > Snapshot Cache > -- > > Key: SPARK-24461 > URL: https://issues.apache.org/jira/browse/SPARK-24461 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Wei Xue >Priority: Major > Labels: bulk-closed > > In some usage scenarios, data staleness is not critical. We can introduce a > snapshot cache of the original data for achieving much better performance. > Different from the current cache, it is resolved by names instead of by plan > matching. Cache rebuild can be manually or by events. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20782) Dataset's isCached operator
[ https://issues.apache.org/jira/browse/SPARK-20782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20782. -- Resolution: Incomplete > Dataset's isCached operator > --- > > Key: SPARK-20782 > URL: https://issues.apache.org/jira/browse/SPARK-20782 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > Labels: bulk-closed > > It'd be very convenient to have {{isCached}} operator that would say whether > a query is cached in-memory or not. > It'd be as simple as the following snippet: > {code} > // val q2: DataFrame > spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23068) Jekyll doc build error does not fail build
[ https://issues.apache.org/jira/browse/SPARK-23068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23068. -- Resolution: Incomplete > Jekyll doc build error does not fail build > -- > > Key: SPARK-23068 > URL: https://issues.apache.org/jira/browse/SPARK-23068 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > Labels: bulk-closed > > +++ /usr/local/bin/Rscript -e ' if("devtools" %in% > rownames(installed.packages())) { library(devtools); > devtools::document(pkg="./pkg", roclets=c("rd")) }' > Error: 'roxygen2' >= 5.0.0 must be installed for this functionality. > Execution halted > jekyll 3.7.0 | Error: R doc generation failed > See SPARK-23065 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20885) JDBC predicate pushdown uses hardcoded date format
[ https://issues.apache.org/jira/browse/SPARK-20885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20885. -- Resolution: Incomplete > JDBC predicate pushdown uses hardcoded date format > -- > > Key: SPARK-20885 > URL: https://issues.apache.org/jira/browse/SPARK-20885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Peter Halverson >Priority: Minor > Labels: bulk-closed > > If a date literal is used in a pushed-down filter expression, e.g. > {code} > val postingDate = java.sql.Date.valueOf("2016-06-03") > val count = jdbcDF.filter($"POSTINGDATE" === postingDate).count > {code} > where the {{POSTINGDATE}} column is of JDBC type Date, the resulting > pushed-down SQL query looks like the following: > {code} > SELECT .. ... FROM WHERE POSTINGDATE = '2016-06-03' > {code} > Specifically, the date is compiled into a string literal using the hardcoded > -MM-dd format that {{java.sql.Date.toString}} emits. Note the implied > string conversion for date (and timestamp) values in {{JDBCRDD.compileValue}} > {code} > /** >* Converts value to SQL expression. >*/ > private def compileValue(value: Any): Any = value match { > case stringValue: String => s"'${escapeSql(stringValue)}'" > case timestampValue: Timestamp => "'" + timestampValue + "'" > case dateValue: Date => "'" + dateValue + "'" > case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") > case _ => value > } > {code} > The resulting query fails if the database is expecting a different format for > date string literals. For example, the default format for Oracle is > 'dd-MMM-yy', so when the relation query is executed, it fails with a syntax > error. > {code} > ORA-01861: literal does not match format string > 01861. 0 - "literal does not match format string" > {code} > In some situations it may be possible to change the database's expected date > format to match the Java format, but this is not always possible (e.g. > reading from an external database server) > Shouldn't this kind of conversion be going through some kind of vendor > specific translation (e.g. through a {{JDBCDialect}})? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24597) Spark ML Pipeline Should support non-linear models => DAGPipeline
[ https://issues.apache.org/jira/browse/SPARK-24597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24597. -- Resolution: Incomplete > Spark ML Pipeline Should support non-linear models => DAGPipeline > - > > Key: SPARK-24597 > URL: https://issues.apache.org/jira/browse/SPARK-24597 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.1 >Reporter: Michael Dreibelbis >Priority: Minor > Labels: bulk-closed > > Currently SparkML Pipeline/PipelineModel supports single linear dataset > transformation > despite the documentation stating otherwise: > [reference > documentation|https://spark.apache.org/docs/2.3.0/ml-pipeline.html#details] > I'm proposing implementing a DAGPipeline and supporting multiple datasets as > input > The code could look something like this: > > {code:java} > val ds1 = /*dataset 1 creation*/ > val ds2 = /*dataset 2 creation*/ > // nodes take on uid from estimator/transformer > val i1 = IdentityNode(new IdentityTransformer("i1")) > val i2 = IdentityNode(new IdentityTransformer("i2")) > val bi = TransformerNode(new Binarizer("bi")) > val cv = EstimatorNode(new CountVectorizer("cv")) > val idf = EstimatorNode(new IDF("idf")) > val j1 = JoinerNode(new Joiner("j1")) > val nodes = Array(i1, i2, bi, cv, idf) > val edges = Array( > ("i1", "cv"), ("cv", "idf"), ("idf", "j1"), > ("i2", "bi"), ("bi", "j1")) > val p = new DAGPipeline(nodes, edges) > .setIdentity("i1", ds1) > .setIdentity("i2", ds2) > val m = p.fit(spark.emptyDataFrame) > m.setIdentity("i1", ds1).setIdentity("i2", ds2) > m.transform(spark.emptyDataFrame) > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-17602. -- Resolution: Incomplete > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao >Priority: Major > Labels: bulk-closed > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20007) Make SparkR apply() functions robust to workers that return empty data.frame
[ https://issues.apache.org/jira/browse/SPARK-20007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20007. -- Resolution: Incomplete > Make SparkR apply() functions robust to workers that return empty data.frame > > > Key: SPARK-20007 > URL: https://issues.apache.org/jira/browse/SPARK-20007 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Priority: Major > Labels: bulk-closed > > When using {{gapply()}} (or other members of {{apply()}} family) with a > schema, Spark will try to parse data returned form the R process on each > worker as Spark DataFrame Rows based on the schema. In this case our provided > schema suggests that we have six column. When an R worker returns results to > JVM, SparkSQL will try to access its columns one by one and cast them to > proper types. If R worker returns nothing, JVM will throw > {{ArrayIndexOutOfBoundsException}} exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17129) Support statistics collection and cardinality estimation for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-17129. -- Resolution: Incomplete > Support statistics collection and cardinality estimation for partitioned > tables > --- > > Key: SPARK-17129 > URL: https://issues.apache.org/jira/browse/SPARK-17129 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Priority: Major > Labels: bulk-closed > > Support statistics collection and cardinality estimation for partitioned > tables. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22054) Allow release managers to inject their keys
[ https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22054. -- Resolution: Incomplete > Allow release managers to inject their keys > --- > > Key: SPARK-22054 > URL: https://issues.apache.org/jira/browse/SPARK-22054 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > > Right now the current release process signs with Patrick's keys, let's update > the scripts to allow the release manager to sign the release as part of the > job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds
[ https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23632. -- Resolution: Incomplete > sparkR.session() error with spark packages - JVM is not ready after 10 seconds > -- > > Key: SPARK-23632 > URL: https://issues.apache.org/jira/browse/SPARK-23632 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Jaehyeon Kim >Priority: Minor > Labels: bulk-closed > > Hi > When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ > as following, > {code:java} > library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib')) > ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 > -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080' > sparkR.session(master = "spark://master:7077", >appName = 'ml demo', >sparkConfig = list(spark.driver.memory = '2g'), >sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2', >spark.driver.extraJavaOptions = ext_opts) > {code} > I see *JVM is not ready after 10 seconds* error. Below shows some of the log > messages. > {code:java} > Ivy Default Cache set to: /home/rstudio/.ivy2/cache > The jars for the packages stored in: /home/rstudio/.ivy2/jars > :: loading settings :: url = > jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml > org.apache.hadoop#hadoop-aws added as a dependency > :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 > confs: [default] > found org.apache.hadoop#hadoop-aws;2.8.2 in central > ... > ... > found javax.servlet.jsp#jsp-api;2.1 in central > Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, : > JVM is not ready after 10 seconds > ... > ... > found joda-time#joda-time;2.9.4 in central > downloading > https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar > ... > ... > ... > xmlenc#xmlenc;0.52 from central in [default] > - > | |modules|| artifacts | > | conf | number| search|dwnlded|evicted|| number|dwnlded| > - > | default | 76 | 76 | 76 | 0 || 76 | 76 | > - > :: retrieving :: org.apache.spark#spark-submit-parent > confs: [default] > 76 artifacts copied, 0 already retrieved (27334kB/56ms) > {code} > It's fine if I re-execute it after the package and its dependencies are > downloaded. > I consider it's because of this part - > https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181 > {code:java} > if (!file.exists(path)) { > stop("JVM is not ready after 10 seconds") > } > {code} > Just wonder if it may be possible to update so that a user can determine how > much to wait? > Thanks. > Regards > Jaehyeon -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23452) Extend test coverage to all ORC readers
[ https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23452. -- Resolution: Incomplete > Extend test coverage to all ORC readers > --- > > Key: SPARK-23452 > URL: https://issues.apache.org/jira/browse/SPARK-23452 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: bulk-closed > > We have five ORC readers. We had better have a test coverage for all ORC > readers. > - Hive Serde > - Hive OrcFileFormat > - Apache ORC Vectorized Wrapper > - Apache ORC Vectorized Copy > - Apache ORC MR -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22658. -- Resolution: Incomplete > SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark > > > Key: SPARK-22658 > URL: https://issues.apache.org/jira/browse/SPARK-22658 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Andy Feng >Priority: Major > Labels: bulk-closed > Attachments: SPIP_ TensorFlowOnSpark.pdf > > Original Estimate: 336h > Remaining Estimate: 336h > > TensorFlowOnSpark (TFoS) was released at github for distributed TensorFlow > training and inference on Apache Spark clusters. TFoS is designed to: > * Easily migrate all existing TensorFlow programs with minimum code change; > * Support all TensorFlow functionalities: synchronous/asynchronous training, > model/data parallelism, inference and TensorBoard; > * Easily integrate with your existing data processing pipelines (ex. Spark > SQL) and machine learning algorithms (ex. MLlib); > * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and > Infiniband. > We propose to merge TFoS into Apache Spark as a scalable deep learning > library to: > * Make deep learning easy for Apache Spark community: Familiar pipeline API > for training and inference; Enable TensorFlow training/inference on existing > Spark clusters. > * Further simplify data scientist experience: Ensure compatibility b/w Apache > Spark and TFoS; Reduce steps for installation. > * Help Apache Spark evolutions on deep learning: Establish a design pattern > for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL > training/inference. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15694) Implement ScriptTransformation in sql/core
[ https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15694. -- Resolution: Incomplete > Implement ScriptTransformation in sql/core > -- > > Key: SPARK-15694 > URL: https://issues.apache.org/jira/browse/SPARK-15694 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Major > Labels: bulk-closed > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21166) Automated ML persistence
[ https://issues.apache.org/jira/browse/SPARK-21166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21166. -- Resolution: Incomplete > Automated ML persistence > > > Key: SPARK-21166 > URL: https://issues.apache.org/jira/browse/SPARK-21166 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > > This JIRA is for discussing the possibility of automating ML persistence. > Currently, custom save/load methods are written for every Model. However, we > could design a mixin which provides automated persistence, inspecting model > data and Params and reading/writing (known types) automatically. This was > brought up in discussions with developers behind > https://github.com/azure/mmlspark > Some issues we will need to consider: > * Providing generic mixin usable in most or all cases > * Handling corner cases (strange Param types, etc.) > * Backwards compatibility (loading models saved by old Spark versions) > Because of backwards compatibility in particular, it may make sense to > implement testing for that first, before we try to address automated > persistence: [SPARK-15573] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23777) Missing DAG arrows between stages
[ https://issues.apache.org/jira/browse/SPARK-23777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23777. -- Resolution: Incomplete > Missing DAG arrows between stages > - > > Key: SPARK-23777 > URL: https://issues.apache.org/jira/browse/SPARK-23777 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0, 2.3.0 >Reporter: Stefano Pettini >Priority: Trivial > Labels: bulk-closed > Attachments: Screenshot-2018-3-23 RDDTestApp - Details for Job 0.png > > > In the Spark UI DAGs, sometimes there are missing arrows between stages. It > seems to happen when the same RDD is shuffled twice. > For example in this case: > {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 10)}} > {{val b = a join a}} > {{b.collect()}} > There's a missing arrow from stage 1 to 2. > _This is an old one, since 1.3.0 at least, still reproducible in 2.3.0._ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23227) Add user guide entry for collecting sub models for cross-validation classes
[ https://issues.apache.org/jira/browse/SPARK-23227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23227. -- Resolution: Incomplete > Add user guide entry for collecting sub models for cross-validation classes > --- > > Key: SPARK-23227 > URL: https://issues.apache.org/jira/browse/SPARK-23227 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nicholas Pentreath >Priority: Minor > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24406) Exposing custom spark scala ml transformers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-24406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24406. -- Resolution: Incomplete > Exposing custom spark scala ml transformers in pyspark > --- > > Key: SPARK-24406 > URL: https://issues.apache.org/jira/browse/SPARK-24406 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Pratyush Sharma >Priority: Minor > Labels: bulk-closed > > How can I use a custom transformer written in scala in a pyspark pipeline. > {code:java} > class UpperTransformer(override val uid: String) > extends UnaryTransformer[String, String, UpperTransformer] { > > def this() = this(Identifiable.randomUID("upper")) > > override protected def validateInputType(inputType: DataType): Unit = { > require(inputType == StringType) > } > > protected def createTransformFunc: String => String = { > _.toUpperCase > } > > protected def outputDataType: DataType = StringType > }{code} > > Use this transformer in pyspark pipeline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25340) Pushes down Sample beneath deterministic Project
[ https://issues.apache.org/jira/browse/SPARK-25340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25340. -- Resolution: Incomplete > Pushes down Sample beneath deterministic Project > > > Key: SPARK-25340 > URL: https://issues.apache.org/jira/browse/SPARK-25340 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Priority: Minor > Labels: bulk-closed > > If computations in Project are heavy (e.g., UDFs), it is useful to push down > sample nodes into deterministic projects; > {code} > scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true) > // without this proposal > == Analyzed Logical Plan == > (id + 3): bigint > Sample 0.0, 0.5, false, 3370873312340343855 > +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L] >+- Range (0, 10, step=1, splits=Some(4)) > == Optimized Logical Plan == > Sample 0.0, 0.5, false, 3370873312340343855 > +- Project [(id#0L + 3) AS (id + 3)#2L] >+- Range (0, 10, step=1, splits=Some(4)) > // with this proposal > == Optimized Logical Plan == > Project [(id#0L + 3) AS (id + 3)#2L] > +- Sample 0.0, 0.5, false, -6519017078291024113 >+- Range (0, 10, step=1, splits=Some(4)) > {code} > POC: https://github.com/apache/spark/compare/master...maropu:SamplePushdown -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23744) Memory leak in ReadableChannelFileRegion
[ https://issues.apache.org/jira/browse/SPARK-23744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23744. -- Resolution: Incomplete > Memory leak in ReadableChannelFileRegion > > > Key: SPARK-23744 > URL: https://issues.apache.org/jira/browse/SPARK-23744 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > Labels: bulk-closed > > In the class _ReadableChannelFileRegion_, the _buffer_ is direct memory, we > should modify _deallocate_ to free it -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23073) Fix incorrect R doc page header for generated sql functions
[ https://issues.apache.org/jira/browse/SPARK-23073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23073. -- Resolution: Incomplete > Fix incorrect R doc page header for generated sql functions > --- > > Key: SPARK-23073 > URL: https://issues.apache.org/jira/browse/SPARK-23073 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.1, 2.3.0 >Reporter: Felix Cheung >Priority: Minor > Labels: bulk-closed > Attachments: Screen Shot 2018-01-14 at 11.11.05 AM.png > > > See title says > {code} > asc {SparkR} > {code} > https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html > http://spark.apache.org/docs/latest/api/R/columnfunctions.html > asc, contains etc are functions generated at runtime. Because of that, their > doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 > picks the doc page title from the first function name by default, in the > presence of any function it can parse. > An attempt to fix here > https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824 > {code} > #' @rdname columnfunctions > #' @export > +#' @name NULL > setGeneric("asc", function(x) { standardGeneric("asc") }) > {code} > But it cause a more severe issue to fail CRAN checks > {code} > * checking for missing documentation entries ... WARNING > Undocumented code objects: > 'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNotNull' > 'isNull' 'like' 'rlike' > All user-level objects in a package should have documentation entries. > See the chapter 'Writing R documentation files' in the 'Writing R > Extensions' manual. > * checking for code/documentation mismatches ... OK > * checking Rd \usage sections ... WARNING > Objects in \usage without \alias in documentation object 'columnfunctions': > 'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNull' > 'isNotNull' 'like' 'rlike' > {code} > To follow up we should > - look for a way to set the doc page title > - http://spark.apache.org/docs/latest/api/R/columnfunctions.html is really > barebone and we should explicitly add a doc page with content (which could > also address the first point) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16707) TransportClientFactory.createClient may throw NPE
[ https://issues.apache.org/jira/browse/SPARK-16707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-16707. -- Resolution: Incomplete > TransportClientFactory.createClient may throw NPE > - > > Key: SPARK-16707 > URL: https://issues.apache.org/jira/browse/SPARK-16707 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen >Priority: Major > Labels: bulk-closed > > I have encounter some NullPointerException when > TransportClientFactory.createClient in my cluster, here is the following > stack trace. > {code} > org.apache.spark.shuffle.FetchFailedException > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.NullPointerException > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:144) > at > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:107) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:146) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:126) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:155) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:319) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299) > ... 32 more > {code} > The code is at >
[jira] [Resolved] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25244. -- Resolution: Incomplete > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-25244 > URL: https://issues.apache.org/jira/browse/SPARK-25244 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Anton Daitche >Priority: Major > Labels: bulk-closed > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would try to come up > with a patch. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23983) Disable X-Frame-Options from Spark UI response headers if explicitly configured
[ https://issues.apache.org/jira/browse/SPARK-23983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23983. -- Resolution: Incomplete > Disable X-Frame-Options from Spark UI response headers if explicitly > configured > --- > > Key: SPARK-23983 > URL: https://issues.apache.org/jira/browse/SPARK-23983 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Taylor Cressy >Priority: Minor > Labels: UI, bulk-closed > > We should introduce a configuration for the spark UI to omit X-Frame-Options > from the response headers if explicitly set. > The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* > to prevent frame-related click-jacking vulnerabilities. This was addressed > in: SPARK-10589 > > {code:java} > val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") > val xFrameOptionsValue = >allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") > ... > // In doGet > response.setHeader("X-Frame-Options", xFrameOptionsValue) > {code} > > The problem with this, is that we only allow the same origin or a singular > host to present the UI with iframes. I propose we add a configuration that > turns this off. > > Use Case: Currently building a "portal UI" for all things related to a > cluster. Embedding the spark UI in the portal is necessary because the > cluster is in the cloud and can only be accessed via an SSH tunnel - as > intended. (The reverse proxy configuration {{*_spark.ui.reverseProxy_* could > be used to simplify connecting to all the workers}}, but this doesn't solve > handling multiple, unrelated, UIs through a single tunnel. > > Moreover, the host that our "portal UI" would reside on is not assigned a > hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't > useful in this case. > > Lastly, the current design does not allow for different hosts to be > configured, i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not > a valid config. > > An alternative option would be to explore Content-Security-Policy: > [https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-12449. -- Resolution: Incomplete > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler >Priority: Major > Labels: bulk-closed > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21972. -- Resolution: Incomplete > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching >Priority: Major > Labels: bulk-closed > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately > ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) > check the persistence level of the input dataset but not any of its parents. > These issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20598) Iterative checkpoints do not get removed from HDFS
[ https://issues.apache.org/jira/browse/SPARK-20598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20598. -- Resolution: Incomplete > Iterative checkpoints do not get removed from HDFS > -- > > Key: SPARK-20598 > URL: https://issues.apache.org/jira/browse/SPARK-20598 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, YARN >Affects Versions: 2.1.0 >Reporter: Guillem Palou >Priority: Major > Labels: bulk-closed > > I am running a pyspark application that makes use of dataframe.checkpoint() > because Spark needs exponential time to compute the plan and eventually I had > to stop it. Using {{checkpoint}} allowed the application to proceed with the > computation, but I noticed that the HDFS cluster was filling up with RDD > files. Spark is running on YARN client mode. > I managed to reproduce the problem in a toy example as below: > {code} > df = spark.createDataFrame([T.Row(a=1, b=2)]).checkpoint() > for i in range(4): > # either line of the following 2 will produce the error > df = df.select('*', F.concat(*df.columns)).cache().checkpoint() > df = df.join(df, on='a').cache().checkpoint() > # the following two lines do not seem to have an effect > gc.collect() > sc._jvm.System.gc() > {code} > After running the code and {{sc.top()}}, I can still see the rdd's > checkpointed in HDFS: > {quote} > guillem@ip-10-9-94-0:~$ hdfs dfs -du -h $CHECKPOINT_PATH > 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-12 > 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-18 > 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-24 > 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-30 > 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-6 > {quote} > The config flag {{spark.cleaner.referenceTracking.cleanCheckpoints}} is set > to {{true}}. I would expect Spark to clean up all RDDs that can't be > accessed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24260) Support for multi-statement SQL in SparkSession.sql API
[ https://issues.apache.org/jira/browse/SPARK-24260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24260. -- Resolution: Incomplete > Support for multi-statement SQL in SparkSession.sql API > --- > > Key: SPARK-24260 > URL: https://issues.apache.org/jira/browse/SPARK-24260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ravindra Nath Kakarla >Priority: Minor > Labels: bulk-closed > > sparkSession.sql API only supports a single SQL statement to be executed for > a call. A multi-statement SQL cannot be executed in a single call. For > example, > {code:java} > SparkSession sparkSession = > SparkSession.builder().appName("MultiStatementSQL") > .master("local").config("", "").getOrCreate() > sparkSession.sql("DROP TABLE IF EXISTS count_employees; CACHE TABLE > employees; CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt > FROM employees; SELECT * FROM count_employees") > {code} > Above code fails with the error, > {code:java} > org.apache.spark.sql.catalyst.parser.ParseException: mismatched input ';' > expecting {code} > Solution to this problem is to use the .sql API multiple times in a specific > order. > {code:java} > sparkSession.sql("DROP TABLE IF EXISTS count_employees") > sparkSession.sql("CACHE TABLE employees") > sparkSession.sql("CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as > cnt FROM employees;") > sparkSession.sql("SELECT * FROM count_employees") > {code} > If these SQL statements come from a string / file, users have to implement > their own parsers to execute this. Like, > {code:java} > val sqlFromFile = """DROP TABLE IF EXISTS count_employees; > |CACHE TABLE employees; > |CREATE TEMPORARY VIEW count_employees AS SELECT count(*) as cnt FROM > employees; SELECT * FROM count_employees""".stripMargin{code} > {code:java} > sqlFromFile.split(";") > .forEach(line => sparkSession.sql(line)) > {code} > This naive parser can fail for many edge cases (like ";" inside a string). > Even if users use the same grammar used by Spark and implement their own > parsing, it can go out of sync with the way Spark parses the statements. > Can support for multiple SQL statements be built into SparkSession.sql API > itself? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21707) Improvement a special case for non-deterministic filters in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21707. -- Resolution: Incomplete > Improvement a special case for non-deterministic filters in optimizer > - > > Key: SPARK-21707 > URL: https://issues.apache.org/jira/browse/SPARK-21707 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen >Priority: Major > Labels: bulk-closed > > Currently, Did a lot of special handling for non-deterministic projects and > filters in optimizer. but not good enough. this patch add a new special case > for non-deterministic filters. Deal with that we only need to read user needs > fields for non-deterministic filters in optimizer. > For example, the condition of filters is nondeterministic. e.g:contains > nondeterministic function(rand function), HiveTableScans optimizer generated: > ``` > HiveTableScans plan:Aggregate [k#2L], [k#2L, k#2L, sum(cast(id#1 as bigint)) > AS sum(id)#395L] > +- Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L] >+- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && > NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0)) > +- MetastoreRelation XXX_database, XXX_table > HiveTableScans plan:Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L] > +- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && > NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0)) >+- MetastoreRelation XXX_database, XXX_table > HiveTableScans plan:Filter ((isnotnull(d004#205) && > (rand(-4530215890880734772) <= 0.5)) && NOT (cast(cast(d004#205 as > decimal(10,0)) as decimal(11,1)) = 0.0)) > +- MetastoreRelation XXX_database, XXX_table > HiveTableScans plan:MetastoreRelation XXX_database, XXX_table > ``` > so HiveTableScan will read all the fields from table. but we only need to > ‘d004’ and 'c010' . it will affect the performance of task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24731) java.io.IOException: s3n://bucketname: 400 : Bad Request
[ https://issues.apache.org/jira/browse/SPARK-24731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24731. -- Resolution: Incomplete > java.io.IOException: s3n://bucketname: 400 : Bad Request > - > > Key: SPARK-24731 > URL: https://issues.apache.org/jira/browse/SPARK-24731 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.1 >Reporter: sivakphani >Priority: Major > Labels: bulk-closed > > I wrote code for connecting aws s3 bucket for read json file through pyspark. > when i submit in locally it getting this error > File "PYSPARK_examples/Pyspark11.py", line 105, in > df=sqlContext.read.json(path) > File > "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py", > line 261, in json > File > "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File > "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > File > "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o29.json. > : java.io.IOException: s3n://bucketname: 400 : Bad Request > at > org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453) > at > org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427) > at > org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411) > at > org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at org.apache.hadoop.fs.s3native.$Proxy12.retrieveMetadata(Unknown > Source) > at > org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) > at > org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:714) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.jets3t.service.impl.rest.HttpException: 400 Bad Request > at > org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:425) > at >
[jira] [Resolved] (SPARK-21405) Add LBFGS solver for GeneralizedLinearRegression
[ https://issues.apache.org/jira/browse/SPARK-21405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21405. -- Resolution: Incomplete > Add LBFGS solver for GeneralizedLinearRegression > > > Key: SPARK-21405 > URL: https://issues.apache.org/jira/browse/SPARK-21405 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Priority: Major > Labels: bulk-closed > > GeneralizedLinearRegression in Spark ML currently only allows 4096 features > because it uses IRLS, and hence WLS, as an optimizer which relies on > collecting the covariance matrix to the driver. GLMs can also be fit by > simple gradient based methods like LBFGS. > The new API from > [SPARK-19762|https://issues.apache.org/jira/browse/SPARK-19762] makes this > easy to add. I've already prototyped it, and it works pretty well. This > change would allow an arbitrary number of features (up to what can fit on a > single node) as in Linear/Logistic regression. > For reference, other GLM packages also support this - e.g. statsmodels, H2O. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22868) 64KB JVM bytecode limit problem with aggregation
[ https://issues.apache.org/jira/browse/SPARK-22868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22868. -- Resolution: Incomplete > 64KB JVM bytecode limit problem with aggregation > > > Key: SPARK-22868 > URL: https://issues.apache.org/jira/browse/SPARK-22868 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > Labels: bulk-closed > > The following programs can throw an exception due to the 64KB JVM bytecode > limit > {code} > val df = spark.sparkContext.parallelize( > Seq((1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.0, 11.1, 12.2, > 13.3, 14.4, 15.5, 16.6, 17.7, 18.8, 19.9, 20.0, 21.1, 22.2)), > 1).toDF() > df.agg( > kurtosis('_1), kurtosis('_2), kurtosis('_3), kurtosis('_4), > kurtosis('_5), > kurtosis('_6), kurtosis('_7), kurtosis('_8), kurtosis('_9), > kurtosis('_10), > kurtosis('_11), kurtosis('_12), kurtosis('_13), kurtosis('_14), > kurtosis('_15) > ).collect > df.groupBy('_22) > .agg( > kurtosis('_1), kurtosis('_2), kurtosis('_3), kurtosis('_4), > kurtosis('_5), > kurtosis('_6), kurtosis('_7), kurtosis('_8), kurtosis('_9), > kurtosis('_10), > kurtosis('_11), kurtosis('_12), kurtosis('_13), kurtosis('_14), > kurtosis('_15) > ).collect > df.groupBy( > round('_1, 0), round('_2, 0), round('_3, 0), round('_4, 0), > round('_5, 0), > round('_6, 0), round('_7, 0), round('_8, 0), round('_9, 0), > round('_10, 0)) > .agg( > kurtosis('_1), kurtosis('_2), kurtosis('_3), kurtosis('_4), > kurtosis('_5), > kurtosis('_6), kurtosis('_7) > ).collect > {code} > */ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
[ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24447. -- Resolution: Incomplete > Pyspark RowMatrix.columnSimilarities() loses spark context > -- > > Key: SPARK-24447 > URL: https://issues.apache.org/jira/browse/SPARK-24447 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Perry Chu >Priority: Minor > Labels: bulk-closed > > The RDD behind the CoordinateMatrix returned by > RowMatrix.columnSimilarities() appears to be losing track of the spark > context if spark is stopped and restarted in pyspark. > I'm pretty new to spark - not sure if the problem is on the python side or > the scala side - would appreciate someone more experienced taking a look. > This snippet should reproduce the error: > {code:java} > import pyspark > from pyspark.mllib.linalg.distributed import RowMatrix > spark.stop() > spark = pyspark.sql.SparkSession.builder.getOrCreate() > rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) > matrix = RowMatrix(rows) > sims = matrix.columnSimilarities() > ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) > print(sims.numRows(),sims.numCols()) > ## This throws an error (stack trace below) > print(sims.entries.first()) > ## Later I tried this > print(rows.context) # > print(sims.entries.context) # PySparkShell>, then throws an error{code} > Error stack trace > {code:java} > --- > AttributeError Traceback (most recent call last) > in () > > 1 sims.entries.first() > /usr/lib/spark/python/pyspark/rdd.py in first(self) > 1374 ValueError: RDD is empty > 1375 """ > -> 1376 rs = self.take(1) > 1377 if rs: > 1378 return rs[0] > /usr/lib/spark/python/pyspark/rdd.py in take(self, num) > 1356 > 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) > -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) > 1359 > 1360 items += res > /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, > partitions, allowLocal) > 999 # SparkContext#runJob. > 1000 mappedRDD = rdd.mapPartitions(partitionFunc) > -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) > 1003 > AttributeError: 'NoneType' object has no attribute 'sc' > {code} > PySpark columnSimilarities documentation > [http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23612) Specify formats for individual DateType and TimestampType columns in schemas
[ https://issues.apache.org/jira/browse/SPARK-23612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23612. -- Resolution: Incomplete > Specify formats for individual DateType and TimestampType columns in schemas > > > Key: SPARK-23612 > URL: https://issues.apache.org/jira/browse/SPARK-23612 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Patrick Young >Priority: Minor > Labels: DataType, bulk-closed, date, spree, sql > > [https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200] > It would be very helpful if it were possible to specify the format for > individual columns in a schema when reading csv files, rather than one format: > {code:java|title=Bar.python|borderStyle=solid} > # Currently can only do something like: > spark.read.option("dateFormat", "MMdd").csv(...) > # Would like to be able to do something like: > schema = StructType([ > StructField("date1", DateType(format="MM/dd/"), True), > StructField("date2", DateType(format="MMdd"), True) > ] > read.schema(schema).csv(...) > {code} > Thanks for any help, input! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases
[ https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23442. -- Resolution: Incomplete > Reading from partitioned and bucketed table uses only bucketSpec.numBuckets > partitions in all cases > --- > > Key: SPARK-23442 > URL: https://issues.apache.org/jira/browse/SPARK-23442 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Pranav Rao >Priority: Major > Labels: bulk-closed > > Through the DataFrameWriter[T] interface I have created a external HIVE table > with 5000 (horizontal) partitions and 50 buckets in each partition. Overall > the dataset is 600GB and the provider is Parquet. > Now this works great when joining with a similarly bucketed dataset - it's > able to avoid a shuffle. > But any action on this Dataframe(from _spark.table("tablename")_), works with > only 50 RDD partitions. This is happening because of > [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. > So the 600GB dataset is only read through 50 tasks, which makes this > partitioning + bucketing scheme not useful. > I cannot expose the base directory of the parquet folder for reading the > dataset, because the partition locations don't follow a (basePath + partSpec) > format. > Meanwhile, are there workarounds to use higher parallelism while reading such > a table? > Let me know if I can help in any way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21730) Consider officially dropping PyPy pre-2.5 support
[ https://issues.apache.org/jira/browse/SPARK-21730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21730. -- Resolution: Incomplete > Consider officially dropping PyPy pre-2.5 support > - > > Key: SPARK-21730 > URL: https://issues.apache.org/jira/browse/SPARK-21730 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > > Jenkins currently tests with PyPy 2.5+, should we consider dropping 2.3 > support from the documentation? > cc [~davies] [~shaneknapp] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23485) Kubernetes should support node blacklist
[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23485. -- Resolution: Incomplete > Kubernetes should support node blacklist > > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > Labels: bulk-closed > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21443) Very long planning duration for queries with lots of operations
[ https://issues.apache.org/jira/browse/SPARK-21443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21443. -- Resolution: Incomplete > Very long planning duration for queries with lots of operations > --- > > Key: SPARK-21443 > URL: https://issues.apache.org/jira/browse/SPARK-21443 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Eyal Zituny >Priority: Minor > Labels: bulk-closed > > Creating a streaming query with large amount of operations and fields (100+) > results in a very long query planning phase. in the example bellow, the plan > phase has taken 35 seconds while the actual batch execution took only 1.3 > second. > after some investigation, i have found out that the root causes of this are > 2 optimizer rules which seems to take most of the planning time: > InferFiltersFromConstraints and PruneFilters > I would suggest the following: > # fix the inefficient optimizer rules > # add warn level logging if a rule has taken more then xx ms > # allow custom removing of optimizer rules (opposite to > spark.experimental.extraOptimizations) > # reuse query plans (optional) where possible > reproducing this issue can be done with the bellow script which simulates the > scenario: > {code:java} > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.execution.streaming.MemoryStream > import > org.apache.spark.sql.streaming.StreamingQueryListener.{QueryProgressEvent, > QueryStartedEvent, QueryTerminatedEvent} > import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQueryListener} > case class Product(pid: Long, name: String, price: Long, ts: Long = > System.currentTimeMillis()) > case class Events (eventId: Long, eventName: String, productId: Long) { > def this(id: Long) = this(id, s"event$id", id%100) > } > object SparkTestFlow { > def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder > .appName("TestFlow") > .master("local[8]") > .getOrCreate() > spark.sqlContext.streams.addListener(new StreamingQueryListener > { > override def onQueryTerminated(event: > QueryTerminatedEvent): Unit = {} > override def onQueryProgress(event: > QueryProgressEvent): Unit = { > if (event.progress.numInputRows>0) { > println(event.progress.toString()) > } > } > override def onQueryStarted(event: QueryStartedEvent): > Unit = {} > }) > > import spark.implicits._ > implicit val sclContext = spark.sqlContext > import org.apache.spark.sql.functions.expr > val seq = (1L to 100L).map(i => Product(i, s"name$i", 10L*i)) > val lookupTable = spark.createDataFrame(seq) > val inputData = MemoryStream[Events] > inputData.addData((1L to 100L).map(i => new Events(i))) > val events = inputData.toDF() > .withColumn("w1", expr("0")) > .withColumn("x1", expr("0")) > .withColumn("y1", expr("0")) > .withColumn("z1", expr("0")) > val numberOfSelects = 40 // set to 100+ and the planning takes > forever > val dfWithSelectsExpr = (2 to > numberOfSelects).foldLeft(events)((df,i) =>{ > val arr = df.columns.++(Array(s"w${i-1} + rand() as > w$i", s"x${i-1} + rand() as x$i", s"y${i-1} + 2 as y$i", s"z${i-1} +1 as > z$i")) > df.selectExpr(arr:_*) > }) > val withJoinAndFilter = dfWithSelectsExpr > .join(lookupTable, expr("productId = pid")) > .filter("productId < 50") > val query = withJoinAndFilter.writeStream > .outputMode("append") > .format("console") > .trigger(ProcessingTime(2000)) > .start() > query.processAllAvailable() > spark.stop() > } > } > {code} > the query progress output will show: > {code:java} > "durationMs" : { > "addBatch" : 1310, > "getBatch" : 6, > "getOffset" : 0, > "*queryPlanning*" : 36924, > "triggerExecution" : 38297, > "walCommit" : 33 > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24946) PySpark - Allow np.Arrays and pd.Series in df.approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-24946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24946. -- Resolution: Incomplete > PySpark - Allow np.Arrays and pd.Series in df.approxQuantile > > > Key: SPARK-24946 > URL: https://issues.apache.org/jira/browse/SPARK-24946 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Paul Westenthanner >Priority: Minor > Labels: DataFrame, beginner, bulk-closed, pyspark > > As Python user it is convenient to pass a numpy array or pandas series > `{{approxQuantile}}(_col_, _probabilities_, _relativeError_)` for the > probabilities parameter. > > Especially for creating cumulative plots (say in 1% steps) it is handy to use > `approxQuantile(col, np.arange(0, 1.0, 0.01), relativeError)`. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23833) Incorrect primitive type check for input arguments of udf
[ https://issues.apache.org/jira/browse/SPARK-23833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23833. -- Resolution: Incomplete > Incorrect primitive type check for input arguments of udf > - > > Key: SPARK-23833 > URL: https://issues.apache.org/jira/browse/SPARK-23833 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.0, 2.3.0 >Reporter: Valentin Nikotin >Priority: Major > Labels: bulk-closed > > There is claimed behavior for scala UDFs with primitive type arguments: > {quote}Note that if you use primitive parameters, you are not able to check > if it is null or not, and the UDF will return null for you if the primitive > input is null. > {quote} > This is initial issue - SPARK-11725 > Correspondent pr - > [PR|https://github.com/apache/spark/pull/9770/commits/a8a30674ce531c9cd10107200a3f72f9539cd8f6] > The problem is that {{ScalaReflection.getParameterTypes}} doesn't work > correctly due to type erasure. > The correct check "if type is primitive" should be based on typeTag something > like this: > {code:java} > typeTag[T].tpe.typeSymbol.asClass.isPrimitive > {code} > > The problem appears if we have high order functions: > {code:java} > val f = (x: Long) => x > def identity[T, U](f: T => U): T => U = (t: T) => f(t) > val udf0 = udf(f) > val udf1 = udf(identity(f)) > val getNull = udf(() => null.asInstanceOf[java.lang.Long]) > spark.range(5).toDF(). > withColumn("udf0", udf0(getNull())). > withColumn("udf1", udf1(getNull())). > show() > spark.range(5).toDF(). > withColumn("udf0", udf0(getNull())). > withColumn("udf1", udf1(getNull())). > explain() > {code} > Test execution on Spark 2.2 spark-shell: > {code:java} > scala> val f = (x: Long) => x > f: Long => Long = > scala> def identity[T, U](f: T => U): T => U = (t: T) => f(t) > identity: [T, U](f: T => U)T => U > scala> val udf0 = udf(f) > udf0: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val udf1 = udf(identity(f)) > udf1: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val getNull = udf(() => null.asInstanceOf[java.lang.Long]) > getNull: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List())) > scala> spark.range(5).toDF(). > | withColumn("udf0", udf0(getNull())). > | withColumn("udf1", udf1(getNull())). > | show() > +---+++ > > | id|udf0|udf1| > +---+++ > | 0|null| 0| > | 1|null| 0| > | 2|null| 0| > | 3|null| 0| > | 4|null| 0| > +---+++ > scala> spark.range(5).toDF(). > | withColumn("udf0", udf0(getNull())). > | withColumn("udf1", udf1(getNull())). > | explain() > == Physical Plan == > *Project [id#19L, if (isnull(UDF())) null else UDF(UDF()) AS udf0#24L, > UDF(UDF()) AS udf1#28L] > +- *Range (0, 5, step=1, splits=6) > {code} > > The typeTag information about input parameters is available in udf function > but only used to get schema, it should be added to ScalaUDF too so that we > can used it later: > {code:java} > def udf[RT: TypeTag, A1: TypeTag, A2: TypeTag](f: Function2[A1, A2, RT]): > UserDefinedFunction = { > val inputTypes = Try(ScalaReflection.schemaFor(typeTag[A1]).dataType :: > ScalaReflection.schemaFor(typeTag[A2]).dataType :: Nil).toOption > UserDefinedFunction(f, ScalaReflection.schemaFor(typeTag[RT]).dataType, > inputTypes) > } > {code} > > Here is current vs desired version: > {code:java} > scala> import org.apache.spark.sql.catalyst.ScalaReflection > import org.apache.spark.sql.catalyst.ScalaReflection > scala> ScalaReflection.getParameterTypes(identity(f)) > res2: Seq[Class[_]] = WrappedArray(class java.lang.Object) > scala> ScalaReflection.getParameterTypes(identity(f)).map(_.isPrimitive) > res7: Seq[Boolean] = ArrayBuffer(false) > {code} > versus > {code:java} > scala> import scala.reflect.runtime.universe.{typeTag, TypeTag} > import scala.reflect.runtime.universe.{typeTag, TypeTag} > scala> def myGetParameterTypes[T : TypeTag, U](func: T => U) = { > | typeTag[T].tpe.typeSymbol.asClass > | } > myGetParameterTypes: [T, U](func: T => U)(implicit evidence$1: > reflect.runtime.universe.TypeTag[T])reflect.runtime.universe.ClassSymbol > scala> myGetParameterTypes(f) > res3: reflect.runtime.universe.ClassSymbol = class Long > scala> myGetParameterTypes(f).isPrimitive > res4: Boolean = true > {code} > Although for this case there is workaround with using {{@specialized(Long)}} > {code:scala} > scala> def identity2[@specialized(Long) T, U](f: T => U): T => U = (t: T) => >
[jira] [Resolved] (SPARK-25215) Make PipelineModel public
[ https://issues.apache.org/jira/browse/SPARK-25215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25215. -- Resolution: Incomplete > Make PipelineModel public > - > > Key: SPARK-25215 > URL: https://issues.apache.org/jira/browse/SPARK-25215 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.3.1 >Reporter: Nicholas Resnick >Priority: Minor > Labels: ML, bulk-closed > > Can PipelineModel be made public? I can't think of a reason for having it be > private to the ml package. > My specific use-case is that I'm creating a feature for > serializing/deserializing PipelineModels in a production environment, and am > trying to write specs for this feature, but I can't easily create > PipelineModels for the specs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23322) Launcher handles can miss application updates if application finishes too quickly
[ https://issues.apache.org/jira/browse/SPARK-23322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23322. -- Resolution: Incomplete > Launcher handles can miss application updates if application finishes too > quickly > - > > Key: SPARK-23322 > URL: https://issues.apache.org/jira/browse/SPARK-23322 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Masiero Vanzin >Priority: Minor > Labels: bulk-closed > > This is the underlying issue in SPARK-23020, which was worked around in our > tests, but still exist in the code. > If a child application finishes too quickly, the launcher code may clean up > the handle's state before the connection from the child has been > acknowledged. This means than the application handle will have a final state > LOST instead of whatever final state the application sent. > This doesn't seem to affect child processes as much as the new in-process > launch mode, and it requires the child application to finish very quickly, > which should be rare for the kind of use case the launcher library covers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24560) Fix some getTimeAsMs as getTimeAsSeconds
[ https://issues.apache.org/jira/browse/SPARK-24560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24560. -- Resolution: Incomplete > Fix some getTimeAsMs as getTimeAsSeconds > > > Key: SPARK-24560 > URL: https://issues.apache.org/jira/browse/SPARK-24560 > Project: Spark > Issue Type: Improvement > Components: Mesos, Spark Core >Affects Versions: 2.3.1 >Reporter: xueyu >Priority: Major > Labels: bulk-closed > > There are some places using "getTimeAsMs" rather than "getTimeAsSeconds". > This will return a wrong value when the user specifies a value without a time > unit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22202) Release tgz content differences for python and R
[ https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22202. -- Resolution: Incomplete > Release tgz content differences for python and R > > > Key: SPARK-22202 > URL: https://issues.apache.org/jira/browse/SPARK-22202 > Project: Spark > Issue Type: Bug > Components: PySpark, SparkR >Affects Versions: 2.1.2, 2.2.0, 2.3.0 >Reporter: Felix Cheung >Priority: Minor > Labels: bulk-closed > > As a follow up to SPARK-22167, currently we are running different > profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we > should consider if these differences are significant and whether they should > be addressed. > A couple of things: > - R.../doc directory is not in any release jar except hadoop 2.6 > - python/dist, python.egg-info are not in any release jar except hadoop 2.7 > - R DESCRIPTION has a few additions > I've checked to confirm these are the same in 2.1.1 release so this isn't a > regression. > {code} > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc: > sparkr-vignettes.Rmd > sparkr-vignettes.R > sparkr-vignettes.html > index.html > Only in spark-2.1.2-bin-hadoop2.7/python: dist > Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python > Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION > 25a26,27 > > NeedsCompilation: no > > Packaged: 2017-10-03 00:42:30 UTC; holden > 31c33 > < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix > --- > > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc > diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html > spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html > 16a17 > > User guides, package vignettes and other > > documentation. > Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application
[ https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20153. -- Resolution: Incomplete > Support Multiple aws credentials in order to access multiple Hive on S3 table > in spark application > --- > > Key: SPARK-20153 > URL: https://issues.apache.org/jira/browse/SPARK-20153 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0 >Reporter: Franck Tago >Priority: Minor > Labels: bulk-closed > > I need to access multiple hive tables in my spark application where each hive > table is > 1- an external table with data sitting on S3 > 2- each table is own by a different AWS user so I need to provide different > AWS credentials. > I am familiar with setting the aws credentials in the hadoop configuration > object but that does not really help me because I can only set one pair of > (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey ) > From my research , there is no easy or elegant way to do this in spark . > Why is that ? > How do I address this use case? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24456) Spark submit - server environment variables are overwritten by client environment variables
[ https://issues.apache.org/jira/browse/SPARK-24456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24456. -- Resolution: Incomplete > Spark submit - server environment variables are overwritten by client > environment variables > > > Key: SPARK-24456 > URL: https://issues.apache.org/jira/browse/SPARK-24456 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Alon Shoham >Priority: Minor > Labels: bulk-closed > > When submitting a spark application in --deploy-mode cluster + spark > standalone cluster, environment variables from the client machine overwrite > server environment variables. > > We use *SPARK_DIST_CLASSPATH* environment variable to add extra required > dependencies to the application. We observed that client machine > SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in > application submission failure. > > We have inspected the code and found: > 1. In org.apache.spark.deploy.Client line 86: > {code:java} > val command = new Command(mainClass, > Seq("{{WORKER_URL}}", "{{USER_JAR}}", driverArgs.mainClass) ++ > driverArgs.driverOptions, > sys.env, classPathEntries, libraryPathEntries, javaOpts){code} > 2. In org.apache.spark.launcher.WorkerCommandBuilder line 35: > {code:java} > childEnv.putAll(command.environment.asJava) > childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome){code} > Seen in line 35 is that the environment is overwritten in the server machine > but in line 36 the SPARK_HOME is restored to the server value. > We think the bug can be fixed by adding a line that restores > SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
[ https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-17877. -- Resolution: Incomplete > Can not checkpoint connectedComponents resulting graph > -- > > Key: SPARK-17877 > URL: https://issues.apache.org/jira/browse/SPARK-17877 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.5.2, 1.6.2, 2.0.1, 2.3.0 >Reporter: Alexander Pivovarov >Priority: Minor > Labels: bulk-closed > > The following code demonstrates the issue > {code} > import org.apache.spark.graphx._ > val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L > -> "kelly")) > val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, > "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) > sc.setCheckpointDir("/tmp/check") > val g = Graph(users, rel) > g.checkpoint // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears > val gg = g.connectedComponents > gg.checkpoint > gg.vertices.collect > gg.edges.collect > gg.isCheckpointed // res5: Boolean = false, /tmp/check still contains only > 1 folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f > {code} > I think the last line should return true instead of false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results
[ https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23298. -- Resolution: Incomplete > distinct.count on Dataset/DataFrame yields non-deterministic results > > > Key: SPARK-23298 > URL: https://issues.apache.org/jira/browse/SPARK-23298 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL, YARN >Affects Versions: 2.1.0, 2.2.0 > Environment: Spark 2.2.0 or 2.1.0 > Java 1.8.0_144 > Yarn version: > {code:java} > Hadoop 2.6.0-cdh5.12.1 > Subversion http://github.com/cloudera/hadoop -r > 520d8b072e666e9f21d645ca6a5219fc37535a52 > Compiled by jenkins on 2017-08-24T16:43Z > Compiled with protoc 2.5.0 > From source with checksum de51bf9693ab9426379a1cd28142cea0 > This command was run using > /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.12.1.jar{code} > > >Reporter: Mateusz Jukiewicz >Priority: Major > Labels: Correctness, CorrectnessBug, bulk-closed, correctness > > This is what happens (EDIT - managed to get a reproducible example): > {code:java} > /* Exemplary spark-shell starting command > /opt/spark/bin/spark-shell \ > --num-executors 269 \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf spark.kryoserializer.buffer.max=512m > // The spark.sql.shuffle.partitions is 2154 here, if that matters > */ > val df = spark.range(1000).withColumn("col1", (rand() * > 1000).cast("long")).withColumn("col2", (rand() * > 1000).cast("long")).drop("id") > df.repartition(5240).write.parquet("/test.parquet") > // Then, ideally in a new session > val df = spark.read.parquet("/test.parquet") > df.distinct.count > // res1: Long = 1001256 > > df.distinct.count > // res2: Long = 55 {code} > -The _text_dataset.out_ file is a dataset with one string per line. The > string has alphanumeric characters as well as colons and spaces. The line > length does not exceed 1200. I don't think that's important though, as the > issue appeared on various other datasets, I just tried to narrow it down to > the simplest possible case.- (the case is now fully reproducible with the > above code) > The observations regarding the issue are as follows: > * I managed to reproduce it on both spark 2.2 and spark 2.1. > * The issue occurs in YARN cluster mode (I haven't tested YARN client mode). > * The issue is not reproducible on a single machine (e.g. laptop) in spark > local mode. > * It seems that once the correct count is computed, it is not possible to > reproduce the issue in the same spark session. In other words, I was able to > get 2-3 incorrect distinct.count results consecutively, but once it got > right, it always returned the correct value. I had to re-run spark-shell to > observe the problem again. > * The issue appears on both Dataset and DataFrame (i.e. using read.text or > read.textFile). > * The issue is not reproducible on RDD (i.e. dataset.rdd.distinct.count). > * Not a single container has failed in those multiple invalid executions. > * YARN doesn't show any warnings or errors in those invalid executions. > * The execution plan determined for both valid and invalid executions was > always the same (it's shown in the _SQL_ tab of the UI). > * The number returned in the invalid executions was always greater than the > correct number (24 014 227). > * This occurs even though the input is already completely deduplicated (i.e. > _distinct.count_ shouldn't change anything). > * The input isn't replicated (i.e. there's only one copy of each file block > on the HDFS). > * The problem is probably not related to reading from HDFS. Spark was always > able to correctly read all input records (which was shown in the UI), and > that number got malformed after the exchange phase: > ** correct execution: > Input Size / Records: 3.9 GB / 24014227 _(first stage)_ > Shuffle Write: 3.3 GB / 24014227 _(first stage)_ > Shuffle Read: 3.3 GB / 24014227 _(second stage)_ > ** incorrect execution: > Input Size / Records: 3.9 GB / 24014227 _(first stage)_ > Shuffle Write: 3.3 GB / 24014227 _(first stage)_ > Shuffle Read: 3.3 GB / 24020150 _(second stage)_ > * The problem might be related with the internal way of Encoders hashing. > The reason might be: > ** in a simple `distinct.count` invocation, there are in total three > hash-related stages (called `HashAggregate`), > ** excerpt from scaladoc for `distinct` method says: > {code:java} >* @note Equality checking is performed directly on the encoded > representation of the data >* and thus is not affected by a custom `equals` function defined on > `T`.{code} > * One of my suspicions was the number of partitions we're using (2154).
[jira] [Resolved] (SPARK-25059) Exception while executing an action on DataFrame that read Json
[ https://issues.apache.org/jira/browse/SPARK-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25059. -- Resolution: Incomplete > Exception while executing an action on DataFrame that read Json > --- > > Key: SPARK-25059 > URL: https://issues.apache.org/jira/browse/SPARK-25059 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.2.0 > Environment: AWS EMR 5.8.0 > Spark 2.2.0 > >Reporter: Kunal Goswami >Priority: Major > Labels: Spark-SQL, bulk-closed > > When I try to read ~9600 Json files using > {noformat} > val test = spark.read.option("header", true).option("inferSchema", > true).json(paths: _*) {noformat} > > Any action on the above created data frame results in: > {noformat} > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "apply2_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V" > of class "org.apache.spark.sql.catalyst.expressions.Generat[73/1850] > pecificUnsafeProjection" grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:949) > at org.codehaus.janino.CodeContext.write(CodeContext.java:839) > at org.codehaus.janino.UnitCompiler.writeOpcode(UnitCompiler.java:11081) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4546) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436) > at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370) > at org.codehaus.janino.Java$Block.accept(Java.java:2471) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220) > at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$IfStatement.accept(Java.java:2621) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436) > at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370) > at org.codehaus.janino.Java$Block.accept(Java.java:2471) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220) > at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$IfStatement.accept(Java.java:2621) > at
[jira] [Resolved] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen
[ https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22105. -- Resolution: Incomplete > Dataframe has poor performance when computing on many columns with codegen > -- > > Key: SPARK-22105 > URL: https://issues.apache.org/jira/browse/SPARK-22105 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Minor > Labels: bulk-closed > > Suppose we have a dataframe with many columns (e.g 100 columns), each column > is DoubleType. > And we need to compute avg on each column. We will find using dataframe avg > will be much slower than using RDD.aggregate. > I observe this issue from this PR: (One pass imputer) > https://github.com/apache/spark/pull/18902 > I also write a minimal testing code to reproduce this issue, I use computing > sum to reproduce this issue: > https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1 > When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be > about 3x slower than `RDD.aggregate`, but if we only compute one column, > dataframe avg will be much faster than `RDD.aggregate`. > The reason of this issue, should be the defact in dataframe codegen. Codegen > will inline everything and generate large code block. When the column number > is large (e.g 100 columns), the codegen size will be too large, which cause > jvm failed to JIT and fall back to byte code interpretation. > This PR should address this issue: > https://github.com/apache/spark/pull/19082 > But we need more performance test against some code in ML after above PR > merged, to check whether this issue is actually fixed. > This JIRA used to track this performance issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22823) Race Condition when reading Broadcast shuffle input. Failed to get broadcast piece
[ https://issues.apache.org/jira/browse/SPARK-22823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22823. -- Resolution: Incomplete > Race Condition when reading Broadcast shuffle input. Failed to get broadcast > piece > -- > > Key: SPARK-22823 > URL: https://issues.apache.org/jira/browse/SPARK-22823 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.1, 2.2.1, 2.3.0 >Reporter: Dmitrii Bundin >Priority: Major > Labels: bulk-closed > > It seems we have a race condition when trying to read shuffle input which is > a broadcast, not direct. To read broadcast MapStatuses at > {code:java} > org.apache.spark.shuffle.BlockStoreShuffleReader::read() > {code} > we submit a message of the type GetMapOutputStatuses(shuffleId) to be > executed in MapOutputTrackerMaster's pool which in turn ends up in creating a > new broadcast at > {code:java} > org.apache.spark.MapOutputTracker::serializeMapStatuses > {code} > if the received statuses bytes more than minBroadcastSize. > So registering the newly created broadcast in the driver's > BlockManagerMasterEndpoint may appear later than some executor asks for the > broadcast piece location from the driver. > In out project we get the following exception on the regular basis: > {code:java} > java.io.IOException: org.apache.spark.SparkException: Failed to get > broadcast_176_piece0 of broadcast_176 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1280) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:661) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:661) > at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) > at > org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:598) > at > org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:660) > at > org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:203) > at > org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:142) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to get > broadcast_176_piece0 of broadcast_176 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:146) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:125) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:186) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1273) > {code} > This exception is appeared when we try to read a broadcast piece. To do this > we need to fetch the broadcast piece location from the driver > {code:java} > org.apache.spark.storage.BlockManagerMaster::getLocations(blockId: BlockId) > {code} > . The
[jira] [Resolved] (SPARK-22911) Migrate structured streaming sources to new DataSourceV2 APIs
[ https://issues.apache.org/jira/browse/SPARK-22911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22911. -- Resolution: Incomplete > Migrate structured streaming sources to new DataSourceV2 APIs > - > > Key: SPARK-22911 > URL: https://issues.apache.org/jira/browse/SPARK-22911 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24608) report number of iteration/progress for ML training
[ https://issues.apache.org/jira/browse/SPARK-24608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24608. -- Resolution: Incomplete > report number of iteration/progress for ML training > --- > > Key: SPARK-24608 > URL: https://issues.apache.org/jira/browse/SPARK-24608 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.1 >Reporter: R >Priority: Major > Labels: bulk-closed > > Debugging big ML models requires careful control of resources (memory, > storage, CPU, progress, etc). Current ML training reports no progress. It > would be ideal to be more verbose during training -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24640. -- Resolution: Incomplete > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > Labels: api, bulk-closed > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24095) Spark Streaming performance drastically drops when when saving dataframes with withColumn
[ https://issues.apache.org/jira/browse/SPARK-24095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24095. -- Resolution: Incomplete > Spark Streaming performance drastically drops when when saving dataframes > with withColumn > - > > Key: SPARK-24095 > URL: https://issues.apache.org/jira/browse/SPARK-24095 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: brian wang >Priority: Major > Labels: bulk-closed > > We have a Spark Streaming application which is streaming data from Kafka and > ingesting the data in HDFS after a series of transformations. We are using > Spark SQL to do the transformations and storing the data into HDFS at two > stages. The ingestion to Spark which we do at the second stage is drastically > reducing the performance of the application. > There are close to 40 Million transactions per hour in the incoming data. WE > have observed a performance bottleneck in the write to hdfs. > Can you please help us optimize the application performance? > This is a critical issue since it is holding our deployment to production > cluster and we are running behind the schedule in production deployment. > > Answer: First Stage Save > test_Transformed_DOW.cache().withColumn("test_class_map", udf(test_class_map, > StringType())(array(test_class))).write.mode("append").option("header","true").csv("/hive/warehouse/test") > Second Stage Save > test_Data_Final=spark.sql("select test1,test2,test3.. when int(seats)>=2 > then 1 when int(seats) < 2 then 0 end as seats from > test_Data_Unpivoted").write.format("parquet").mode("append").saveAsTable("test_Data_Output") > It is the first save stage which is slowing our spark application's > performance if we enable it. If we disable it, the application seems to catch > up with the incoming data flow. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23563) make the size fo cache in CodeGenerator configable
[ https://issues.apache.org/jira/browse/SPARK-23563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23563. -- Resolution: Incomplete > make the size fo cache in CodeGenerator configable > -- > > Key: SPARK-23563 > URL: https://issues.apache.org/jira/browse/SPARK-23563 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: kejiqing >Priority: Minor > Labels: bulk-closed > > the cache in class > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator has a hard > cod maxmunSize 100, current code is: > > {code:java} > // scala > private val cache = CacheBuilder.newBuilder() > .maximumSize(100) > .build( > new CacheLoader[CodeAndComment, (GeneratedClass, Int)]() { > override def load(code: CodeAndComment): (GeneratedClass, Int) = { > val startTime = System.nanoTime() > val result = doCompile(code) > val endTime = System.nanoTime() > def timeMs: Double = (endTime - startTime).toDouble / 100 > CodegenMetrics.METRIC_SOURCE_CODE_SIZE.update(code.body.length) > CodegenMetrics.METRIC_COMPILATION_TIME.update(timeMs.toLong) > logInfo(s"Code generated in $timeMs ms") > result > } > }) > {code} > In some specific situation, for example: a long term and spark tasks are > unchanged, the size of cache maximumSize configuration is a better idea. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time
[ https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24342. -- Resolution: Incomplete > Large Task prior scheduling to Reduce overall execution time > > > Key: SPARK-24342 > URL: https://issues.apache.org/jira/browse/SPARK-24342 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: gao >Priority: Major > Labels: bulk-closed > Attachments: tasktimespan.PNG > > > When performing a set of concurrent tasks, if the relatively large task > (long-time task) performs the first small-task execution, the overall > execution time > can be shortened. > Therefore, Spark needs to add a new function to perform Large Task of a group > of task set prior scheduling and small tasks after scheduling > The time span of the task can be identified based on the historical > execution time. We can think that the task with a long execution time will > longe in > future. Record the last task execution time together with the task's key as a > log file, and load the log file at the next execution time. use The > RangePartitioner and partitioning partitioning methods prioritize large tasks > and can achieve concurrent task optimization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24457) Performance improvement while converting stringToTimestamp in DateTimeUtils
[ https://issues.apache.org/jira/browse/SPARK-24457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24457. -- Resolution: Incomplete > Performance improvement while converting stringToTimestamp in DateTimeUtils > --- > > Key: SPARK-24457 > URL: https://issues.apache.org/jira/browse/SPARK-24457 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sharad Sonker >Priority: Minor > Labels: bulk-closed > > stringToTimestamp in DateTimeUtils creates Calendar instance for each input > row even if the input timezone is same. This can be improved by caching the > calendar instance for each input timezone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20074) Make buffer size in unsafe external sorter configurable
[ https://issues.apache.org/jira/browse/SPARK-20074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20074. -- Resolution: Incomplete > Make buffer size in unsafe external sorter configurable > --- > > Key: SPARK-20074 > URL: https://issues.apache.org/jira/browse/SPARK-20074 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Sital Kedia >Priority: Major > Labels: bulk-closed > > Currently, it is hardcoded to 32kb, see - > https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L123 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25480) Dynamic partitioning + saveAsTable with multiple partition columns create empty directory
[ https://issues.apache.org/jira/browse/SPARK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25480. -- Resolution: Incomplete > Dynamic partitioning + saveAsTable with multiple partition columns create > empty directory > - > > Key: SPARK-25480 > URL: https://issues.apache.org/jira/browse/SPARK-25480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Daniel Mateus Pires >Priority: Minor > Labels: bulk-closed > Attachments: dynamic_partitioning.json > > > We use .saveAsTable and dynamic partitioning as our only way to write data to > S3 from Spark. > When only 1 partition column is defined for a table, .saveAsTable behaves as > expected: > - with Overwrite mode it will create a table if it doesn't exist and write > the data > - with Append mode it will append to a given partition > - with Overwrite mode if the table exists it will overwrite the partition > If 2 partition columns are used however, the directory is created on S3 with > the SUCCESS file, but no data is actually written > our solution is to check if the table doesn't exist, and in that case, set > the partitioning mode back to static before running saveAsTable: > {code} > spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") > df.write.mode("overwrite").partitionBy("year", "month").option("path", > "s3://hbc-data-warehouse/integration/users_test").saveAsTable("users_test") > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23994) Add Host To Blacklist If Shuffle Cannot Complete
[ https://issues.apache.org/jira/browse/SPARK-23994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23994. -- Resolution: Incomplete > Add Host To Blacklist If Shuffle Cannot Complete > > > Key: SPARK-23994 > URL: https://issues.apache.org/jira/browse/SPARK-23994 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Shuffle >Affects Versions: 2.3.0 >Reporter: David Mollitor >Priority: Major > Labels: bulk-closed > > If a node cannot be reached for shuffling data, add the node to the blacklist > and retry the current stage. > {code:java} > 2018-04-10 20:25:55,065 ERROR [Block Fetch Retry-3] > shuffle.RetryingBlockFetcher > (RetryingBlockFetcher.java:fetchAllOutstanding(142)) - Exception while > beginning fetch of 711 outstanding blocks (after 3 retries) > java.io.IOException: Failed to connect to host.local/10.11.12.13:7337 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.ConnectException: Connection refused: > host.local/10.11.12.13:7337 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24474) Cores are left idle when there are a lot of tasks to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24474. -- Resolution: Incomplete > Cores are left idle when there are a lot of tasks to run > > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > Labels: bulk-closed > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled > * The stage that hangs (referred to as "Second stage" above) has a lower > 'Stage Id' than the first one that completes > * This happens with spark.shuffle.service.enabled set to true and false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets
[ https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25125. -- Resolution: Incomplete > Spark SQL percentile_approx takes longer than Hive version for large datasets > - > > Key: SPARK-25125 > URL: https://issues.apache.org/jira/browse/SPARK-25125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Mir Ali >Priority: Major > Labels: bulk-closed > > The percentile_approx function in Spark SQL takes much longer than the > previous Hive implementation for large data sets (7B rows grouped into 200k > buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark > 2.1.0. > The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, > this does not finish at all in more than 2 hours. Also tried this with > different accuracy values 5000,1000,500, the timing does get better with > smaller datasets with the new version, but the speed difference is > insignificant > > Infrastructure used: > AWS EMR -> Spark 2.1.0 > vs > AWS EMR -> Spark 2.3.1 > > spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g > --conf spark.sql.shuffle.partitions=2000 --conf > spark.default.parallelism=2000 --num-executors=75 --executor-cores=2 > {code:java} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val df=spark.range(70L).withColumn("some_grouping_id", > round(rand()*20L).cast(LongType)) > df.createOrReplaceTempView("tab") > val percentile_query = """ select some_grouping_id, percentile_approx(id, > array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ > spark.sql(percentile_query).collect() > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org