[jira] [Created] (SPARK-12904) Strength reduction for integer/decimal comparisons
Reynold Xin created SPARK-12904: --- Summary: Strength reduction for integer/decimal comparisons Key: SPARK-12904 URL: https://issues.apache.org/jira/browse/SPARK-12904 Project: Spark Issue Type: Bug Components: Optimizer, SQL Reporter: Reynold Xin We can do the following strength reduction for comparisons between an integral column and a decimal literal: 1. int_col > decimal_literal => int_col > floor(decimal_literal) 2. int_col >= decimal_literal => int_col > ceil(decimal_literal) 3. int_col < decimal_literal => int_col < floor(decimal_literal) 4. int_col <= decimal_literal => int_col < ceil(decimal_literal) This is useful more as soon as we start parsing floating point numeric literals as decimals rather than doubles (SPARK-12848). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12904) Strength reduction for integer/decimal comparisons
[ https://issues.apache.org/jira/browse/SPARK-12904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106436#comment-15106436 ] Reynold Xin commented on SPARK-12904: - [~viirya] maybe you can add this when you have a chance. Thanks! > Strength reduction for integer/decimal comparisons > -- > > Key: SPARK-12904 > URL: https://issues.apache.org/jira/browse/SPARK-12904 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Reporter: Reynold Xin > > We can do the following strength reduction for comparisons between an > integral column and a decimal literal: > 1. int_col > decimal_literal => int_col > floor(decimal_literal) > 2. int_col >= decimal_literal => int_col > ceil(decimal_literal) > 3. int_col < decimal_literal => int_col < floor(decimal_literal) > 4. int_col <= decimal_literal => int_col < ceil(decimal_literal) > This is more useful as soon as we start parsing floating point numeric > literals as decimals rather than doubles (SPARK-12848). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12905) PCAModel return eigenvalues for PySpark
[ https://issues.apache.org/jira/browse/SPARK-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12905: Priority: Minor (was: Trivial) > PCAModel return eigenvalues for PySpark > --- > > Key: SPARK-12905 > URL: https://issues.apache.org/jira/browse/SPARK-12905 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > PCAModel return eigenvalues for PySpark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12905) PCAModel return eigenvalues for PySpark
[ https://issues.apache.org/jira/browse/SPARK-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12905: Priority: Trivial (was: Minor) > PCAModel return eigenvalues for PySpark > --- > > Key: SPARK-12905 > URL: https://issues.apache.org/jira/browse/SPARK-12905 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Trivial > > PCAModel return eigenvalues for PySpark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12905) PCAModel return eigenvalues for PySpark
Yanbo Liang created SPARK-12905: --- Summary: PCAModel return eigenvalues for PySpark Key: SPARK-12905 URL: https://issues.apache.org/jira/browse/SPARK-12905 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor PCAModel return eigenvalues for PySpark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12905) PCAModel return eigenvalues for PySpark
[ https://issues.apache.org/jira/browse/SPARK-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12905: Assignee: (was: Apache Spark) > PCAModel return eigenvalues for PySpark > --- > > Key: SPARK-12905 > URL: https://issues.apache.org/jira/browse/SPARK-12905 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > PCAModel return eigenvalues for PySpark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12905) PCAModel return eigenvalues for PySpark
[ https://issues.apache.org/jira/browse/SPARK-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106444#comment-15106444 ] Apache Spark commented on SPARK-12905: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/10830 > PCAModel return eigenvalues for PySpark > --- > > Key: SPARK-12905 > URL: https://issues.apache.org/jira/browse/SPARK-12905 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > PCAModel return eigenvalues for PySpark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12905) PCAModel return eigenvalues for PySpark
[ https://issues.apache.org/jira/browse/SPARK-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12905: Assignee: Apache Spark > PCAModel return eigenvalues for PySpark > --- > > Key: SPARK-12905 > URL: https://issues.apache.org/jira/browse/SPARK-12905 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > PCAModel return eigenvalues for PySpark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12897) spark ui with tachyon show all Stream Blocks
[ https://issues.apache.org/jira/browse/SPARK-12897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106458#comment-15106458 ] Sean Owen commented on SPARK-12897: --- It's not clear what you're describing. Can you say more? I am also not sure what the future of Tachyon is in Spark itself; it may live outside the project. > spark ui with tachyon show all Stream Blocks > > > Key: SPARK-12897 > URL: https://issues.apache.org/jira/browse/SPARK-12897 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.6.0 > Environment: "spark.externalBlockStore.url", "tachyon://l-xxx:19998" > "spark.externalBlockStore.blockManager", > "org.apache.spark.storage.TachyonBlockManager" > StorageLevel.OFF_HEAP >Reporter: astralidea > > when I use tachyon .I click spark application UI storage. > the job already running for 24hours. > but stream Blocks continued to growing and the page is loading and showing > all blocks but some block is never use.and loading slowly and slowly. > I think tachyon should use traditional way. if block is never used it means > Size is 0.0B.it is not need to show in storage page.just as traditional way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12876) Race condition when driver rapidly shutdown after started.
[ https://issues.apache.org/jira/browse/SPARK-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12876. --- Resolution: Duplicate > Race condition when driver rapidly shutdown after started. > -- > > Key: SPARK-12876 > URL: https://issues.apache.org/jira/browse/SPARK-12876 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: jeffonia Tung >Priority: Minor > > It's a little same as the issue: SPARK-4300. Well, this time, it's happen on > the driver occasionally. > [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver > driver-20160118171237-0009 > [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar > file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar > to /data/dbcenter/cdh5/spark-1.4.0-bin-hado > op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar > [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying > /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar > to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri > ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar > [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: > "/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" > ."org.apache.spark.deploy.worker.DriverWrapper".. > [INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor > app-20160118171240-0256/15 for DirectKafkaStreamingV2 > [INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: > "/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" > ."org.apache.spark.executor.CoarseGrainedExecutorBackend".. > [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill driver > driver-20160118164724-0008 > [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Redirection to > /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/driver-20160118164724-0008/stdout > closed: Stream closed > [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill executor > app-20160118164728-0250/15 > [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Runner thread for executor > app-20160118164728-0250/15 interrupted > [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Killing process! > [ERROR 2016-01-18 17:12:49 (Logging.scala:96)] Error writing stream to file > /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/app-20160118164728-0250/15/stdout > java.io.IOException: Stream closed > at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:272) > at java.io.BufferedInputStream.read(BufferedInputStream.java:334) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at > org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) > at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) > at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) > at > org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) > [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Executor > app-20160118164728-0250/15 finished with state KILLED exitStatus 143 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12848) Parse number as decimal rather than doubles
[ https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12848: Summary: Parse number as decimal rather than doubles (was: Parse number as decimal) > Parse number as decimal rather than doubles > --- > > Key: SPARK-12848 > URL: https://issues.apache.org/jira/browse/SPARK-12848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu > > Right now, Hive parser will parse 1.23 as double, when it's used with decimal > columns, you will turn the decimal into double, lose the precision. > We should follow most database had done, parse 1.23 as double, it will be > converted into double when used with double. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12848) Parse numbers as decimals rather than doubles
[ https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12848: Summary: Parse numbers as decimals rather than doubles (was: Parse number as decimal rather than doubles) > Parse numbers as decimals rather than doubles > - > > Key: SPARK-12848 > URL: https://issues.apache.org/jira/browse/SPARK-12848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu > > Right now, Hive parser will parse 1.23 as double, when it's used with decimal > columns, you will turn the decimal into double, lose the precision. > We should follow most database had done, parse 1.23 as double, it will be > converted into double when used with double. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12904) Strength reduction for integer/decimal comparisons
[ https://issues.apache.org/jira/browse/SPARK-12904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12904: Description: We can do the following strength reduction for comparisons between an integral column and a decimal literal: 1. int_col > decimal_literal => int_col > floor(decimal_literal) 2. int_col >= decimal_literal => int_col > ceil(decimal_literal) 3. int_col < decimal_literal => int_col < floor(decimal_literal) 4. int_col <= decimal_literal => int_col < ceil(decimal_literal) This is more useful as soon as we start parsing floating point numeric literals as decimals rather than doubles (SPARK-12848). was: We can do the following strength reduction for comparisons between an integral column and a decimal literal: 1. int_col > decimal_literal => int_col > floor(decimal_literal) 2. int_col >= decimal_literal => int_col > ceil(decimal_literal) 3. int_col < decimal_literal => int_col < floor(decimal_literal) 4. int_col <= decimal_literal => int_col < ceil(decimal_literal) This is useful more as soon as we start parsing floating point numeric literals as decimals rather than doubles (SPARK-12848). > Strength reduction for integer/decimal comparisons > -- > > Key: SPARK-12904 > URL: https://issues.apache.org/jira/browse/SPARK-12904 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Reporter: Reynold Xin > > We can do the following strength reduction for comparisons between an > integral column and a decimal literal: > 1. int_col > decimal_literal => int_col > floor(decimal_literal) > 2. int_col >= decimal_literal => int_col > ceil(decimal_literal) > 3. int_col < decimal_literal => int_col < floor(decimal_literal) > 4. int_col <= decimal_literal => int_col < ceil(decimal_literal) > This is more useful as soon as we start parsing floating point numeric > literals as decimals rather than doubles (SPARK-12848). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12892) Support plugging in Spark scheduler
[ https://issues.apache.org/jira/browse/SPARK-12892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106471#comment-15106471 ] Sean Owen commented on SPARK-12892: --- Do you intend some of the same stuff described in https://issues.apache.org/jira/browse/SPARK-3561 ? this was rejected a while ago. > Support plugging in Spark scheduler > > > Key: SPARK-12892 > URL: https://issues.apache.org/jira/browse/SPARK-12892 > Project: Spark > Issue Type: Improvement >Reporter: Timothy Chen > > Currently the only supported cluster schedulers are standalone, Mesos, Yarn > and Simr. However if users like to build a new one it must be merged back > into main, and might not be desirable for Spark and hard to iterate. > Instead, we should make a plugin architecture possible so that when users > like to integrate with new scheduler it can plugged in via configuration and > runtime loading instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106548#comment-15106548 ] Sasi commented on SPARK-12906: -- Hi, I'm running endless script that does query on my table using SqlSpark. I did a dump using jmap tool before the test and after the test. I also run GC on the environment and I still saw the size of LognSQLMetricValue didn't change and only increased. My suspect is on the count method, because i'm running dataFrame.distinct().count(). Sasi > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12903) Add covar_samp and covar_pop for SparkR
[ https://issues.apache.org/jira/browse/SPARK-12903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12903: Assignee: Apache Spark > Add covar_samp and covar_pop for SparkR > --- > > Key: SPARK-12903 > URL: https://issues.apache.org/jira/browse/SPARK-12903 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang >Assignee: Apache Spark > > Add covar_samp and covar_pop for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12903) Add covar_samp and covar_pop for SparkR
[ https://issues.apache.org/jira/browse/SPARK-12903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106422#comment-15106422 ] Apache Spark commented on SPARK-12903: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/10829 > Add covar_samp and covar_pop for SparkR > --- > > Key: SPARK-12903 > URL: https://issues.apache.org/jira/browse/SPARK-12903 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang > > Add covar_samp and covar_pop for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12903) Add covar_samp and covar_pop for SparkR
[ https://issues.apache.org/jira/browse/SPARK-12903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12903: Assignee: (was: Apache Spark) > Add covar_samp and covar_pop for SparkR > --- > > Key: SPARK-12903 > URL: https://issues.apache.org/jira/browse/SPARK-12903 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang > > Add covar_samp and covar_pop for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12903) Add covar_samp and covar_pop for SparkR
Yanbo Liang created SPARK-12903: --- Summary: Add covar_samp and covar_pop for SparkR Key: SPARK-12903 URL: https://issues.apache.org/jira/browse/SPARK-12903 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Yanbo Liang Add covar_samp and covar_pop for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12904) Strength reduction for integer/decimal comparisons
[ https://issues.apache.org/jira/browse/SPARK-12904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106491#comment-15106491 ] Liang-Chi Hsieh commented on SPARK-12904: - Yeah. I would like to do. Thanks! > Strength reduction for integer/decimal comparisons > -- > > Key: SPARK-12904 > URL: https://issues.apache.org/jira/browse/SPARK-12904 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Reporter: Reynold Xin > > We can do the following strength reduction for comparisons between an > integral column and a decimal literal: > 1. int_col > decimal_literal => int_col > floor(decimal_literal) > 2. int_col >= decimal_literal => int_col > ceil(decimal_literal) > 3. int_col < decimal_literal => int_col < floor(decimal_literal) > 4. int_col <= decimal_literal => int_col < ceil(decimal_literal) > This is more useful as soon as we start parsing floating point numeric > literals as decimals rather than doubles (SPARK-12848). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7683) Confusing behavior of fold function of RDD in pyspark
[ https://issues.apache.org/jira/browse/SPARK-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-7683: Assignee: Sean Owen > Confusing behavior of fold function of RDD in pyspark > - > > Key: SPARK-7683 > URL: https://issues.apache.org/jira/browse/SPARK-7683 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 1.3.1 >Reporter: Ai He >Assignee: Sean Owen >Priority: Minor > Labels: releasenotes > Fix For: 2.0.0 > > > This will make the “fold” function consistent with the "fold" in rdd.scala > and other "aggregate" functions where “acc” goes first. Otherwise, users have > to write a lambda function like “lambda x, y: op(y, x)” if they want to use > “zeroValue” to change the result type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7683) Confusing behavior of fold function of RDD in pyspark
[ https://issues.apache.org/jira/browse/SPARK-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7683. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10771 [https://github.com/apache/spark/pull/10771] > Confusing behavior of fold function of RDD in pyspark > - > > Key: SPARK-7683 > URL: https://issues.apache.org/jira/browse/SPARK-7683 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 1.3.1 >Reporter: Ai He >Priority: Minor > Labels: releasenotes > Fix For: 2.0.0 > > > This will make the “fold” function consistent with the "fold" in rdd.scala > and other "aggregate" functions where “acc” goes first. Otherwise, users have > to write a lambda function like “lambda x, y: op(y, x)” if they want to use > “zeroValue” to change the result type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
Sasi created SPARK-12906: Summary: LongSQLMetricValue cause memory leak on Spark 1.5.1 Key: SPARK-12906 URL: https://issues.apache.org/jira/browse/SPARK-12906 Project: Spark Issue Type: Bug Affects Versions: 1.5.1 Reporter: Sasi Hi, I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. Now, after doing another dump heap I notice, after 2 hours, that LongSQLMetricValue cause memory leak. Didn't see any bug or document about it. Thanks, Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasi updated SPARK-12906: - Attachment: screenshot-1.png > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106498#comment-15106498 ] Sean Owen commented on SPARK-12906: --- Please provide more detail. What leads you to think there's a leak, and where do you suspect the leak is? > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9976) create function do not work
[ https://issues.apache.org/jira/browse/SPARK-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106517#comment-15106517 ] ocean commented on SPARK-9976: -- the second problem I just found that only function can not describe, but still can use. Just a little problem > create function do not work > --- > > Key: SPARK-9976 > URL: https://issues.apache.org/jira/browse/SPARK-9976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 > Environment: spark 1.4.1 yarn 2.2.0 >Reporter: cen yuhai > > I use beeline to connect to ThriftServer, but add jar can not work, so I use > create function , see the link below. > http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html > I do as blow: > {code} > create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR > 'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; > {code} > It returns Ok, and I connect to the metastore, I see records in table FUNCS. > {code} > select gdecodeorder(t1) from tableX limit 1; > {code} > It returns error 'Couldn't find function default.gdecodeorder' > This is the Exception > {code} > 15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException > as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: > java.lang.RuntimeException: Couldn't find function default.gdecodeorder > 15/08/14 15:04:47 ERROR RetryingHMSHandler: > MetaException(message:NoSuchObjectException(message:Function > default.t_gdecodeorder does not exist)) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740) > at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) > at com.sun.proxy.$Proxy21.get_function(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721) > at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) > at com.sun.proxy.$Proxy22.getFunction(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662) > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546) > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579) > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645) > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652) > at > org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54) > at > org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376) > at > org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44) > at > org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44) > at > org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) > at >
[jira] [Commented] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106596#comment-15106596 ] Sasi commented on SPARK-12906: -- added dumps > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasi updated SPARK-12906: - Attachment: dump1.PNG After GC. > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: dump1.PNG, screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasi updated SPARK-12906: - Comment: was deleted (was: added dumps ) > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: dump1.PNG, screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106599#comment-15106599 ] Sasi edited comment on SPARK-12906 at 1/19/16 11:14 AM: added dump After GC. Code: while (true) { subscribersDataFrame.distinct().count() } was (Author: sasi2103): added dump After GC. > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: dump1.PNG, screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106599#comment-15106599 ] Sasi edited comment on SPARK-12906 at 1/19/16 11:12 AM: added dump After GC. was (Author: sasi2103): After GC. > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: dump1.PNG, screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12560) SqlTestUtils.stripSparkFilter needs to copy utf8strings
[ https://issues.apache.org/jira/browse/SPARK-12560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-12560. Resolution: Fixed Assignee: Imran Rashid Resolved by https://github.com/apache/spark/pull/10510 > SqlTestUtils.stripSparkFilter needs to copy utf8strings > --- > > Key: SPARK-12560 > URL: https://issues.apache.org/jira/browse/SPARK-12560 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > > {{SqlTestUtils.stripSparkFilter}} needs to make copies of the UTF8Strings, > eg., with {{FromUnsafeProjection}} to avoid returning duplicates of the same > row (see SPARK-9459). > Right now, this isn't causing any problems, since the parquet string > predicate pushdown is turned off (see SPARK-11153). However I ran into this > while trying to get the predicate pushdown to work with a different version > of parquet. Without this fix, there were errors like: > {noformat} > [info] !== Correct Answer - 4 == == Spark Answer - 4 == > [info] ![1][2] > [info][2][2] > [info] ![3][4] > [info][4][4] (QueryTest.scala:127) > {noformat} > I figure its worth making this change now while I ran into it. PR coming > shortly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries
[ https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12910: Assignee: (was: Apache Spark) > Support for specifying version of R to use while creating sparkR libraries > -- > > Key: SPARK-12910 > URL: https://issues.apache.org/jira/browse/SPARK-12910 > Project: Spark > Issue Type: Improvement > Components: SparkR > Environment: Linux >Reporter: Shubhanshu Mishra >Priority: Minor > Labels: installation, sparkR > > When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. > However, a user might have locally installed their own version of R. There > should be a way to specify which R version to use. > I have fixed this in my code using the following patch: > ``` > $ git diff HEAD > diff --git a/R/README.md b/R/README.md > index 005f56d..99182e5 100644 > --- a/R/README.md > +++ b/R/README.md > @@ -1,6 +1,15 @@ > # R on Spark > > SparkR is an R package that provides a light-weight frontend to use Spark > from R. > +### Installing sparkR > + > +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be > done by running the script `$SPARK_HOME/R/install-dev.sh`. > +By default the above script uses the system wide installation of R. However, > this can be changed to any user installed location of R by giving the full > path of the `$R_HOME` as the first argument to the install-dev.sh script. > +Example: > +``` > +# where /home/username/R is where R is installed and /home/username/R/bin > contains the files R and RScript > +./install-dev.sh /home/username/R > +``` > > ### SparkR development > > diff --git a/R/install-dev.sh b/R/install-dev.sh > index 4972bb9..a8efa86 100755 > --- a/R/install-dev.sh > +++ b/R/install-dev.sh > @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib" > mkdir -p $LIB_DIR > > pushd $FWDIR > /dev/null > +if [ ! -z "$1" ] > + then > +R_HOME="$1/bin" > + else > +R_HOME="$(dirname $(which R))" > +fi > +echo "USING R_HOME = $R_HOME" > > # Generate Rd files if devtools is installed > -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }' > +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }' > > # Install SparkR to $LIB_DIR > -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ > +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ > > # Zip the SparkR package so that it can be distributed to worker nodes on > YARN > cd $LIB_DIR > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12560) SqlTestUtils.stripSparkFilter needs to copy utf8strings
[ https://issues.apache.org/jira/browse/SPARK-12560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-12560: --- Fix Version/s: 2.0.0 > SqlTestUtils.stripSparkFilter needs to copy utf8strings > --- > > Key: SPARK-12560 > URL: https://issues.apache.org/jira/browse/SPARK-12560 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.0.0 > > > {{SqlTestUtils.stripSparkFilter}} needs to make copies of the UTF8Strings, > eg., with {{FromUnsafeProjection}} to avoid returning duplicates of the same > row (see SPARK-9459). > Right now, this isn't causing any problems, since the parquet string > predicate pushdown is turned off (see SPARK-11153). However I ran into this > while trying to get the predicate pushdown to work with a different version > of parquet. Without this fix, there were errors like: > {noformat} > [info] !== Correct Answer - 4 == == Spark Answer - 4 == > [info] ![1][2] > [info][2][2] > [info] ![3][4] > [info][4][4] (QueryTest.scala:127) > {noformat} > I figure its worth making this change now while I ran into it. PR coming > shortly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries
[ https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12910: Assignee: Apache Spark > Support for specifying version of R to use while creating sparkR libraries > -- > > Key: SPARK-12910 > URL: https://issues.apache.org/jira/browse/SPARK-12910 > Project: Spark > Issue Type: Improvement > Components: SparkR > Environment: Linux >Reporter: Shubhanshu Mishra >Assignee: Apache Spark >Priority: Minor > Labels: installation, sparkR > > When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. > However, a user might have locally installed their own version of R. There > should be a way to specify which R version to use. > I have fixed this in my code using the following patch: > ``` > $ git diff HEAD > diff --git a/R/README.md b/R/README.md > index 005f56d..99182e5 100644 > --- a/R/README.md > +++ b/R/README.md > @@ -1,6 +1,15 @@ > # R on Spark > > SparkR is an R package that provides a light-weight frontend to use Spark > from R. > +### Installing sparkR > + > +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be > done by running the script `$SPARK_HOME/R/install-dev.sh`. > +By default the above script uses the system wide installation of R. However, > this can be changed to any user installed location of R by giving the full > path of the `$R_HOME` as the first argument to the install-dev.sh script. > +Example: > +``` > +# where /home/username/R is where R is installed and /home/username/R/bin > contains the files R and RScript > +./install-dev.sh /home/username/R > +``` > > ### SparkR development > > diff --git a/R/install-dev.sh b/R/install-dev.sh > index 4972bb9..a8efa86 100755 > --- a/R/install-dev.sh > +++ b/R/install-dev.sh > @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib" > mkdir -p $LIB_DIR > > pushd $FWDIR > /dev/null > +if [ ! -z "$1" ] > + then > +R_HOME="$1/bin" > + else > +R_HOME="$(dirname $(which R))" > +fi > +echo "USING R_HOME = $R_HOME" > > # Generate Rd files if devtools is installed > -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }' > +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }' > > # Install SparkR to $LIB_DIR > -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ > +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ > > # Zip the SparkR package so that it can be distributed to worker nodes on > YARN > cd $LIB_DIR > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries
[ https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107370#comment-15107370 ] Shubhanshu Mishra commented on SPARK-12910: --- I have created a pull request at https://github.com/apache/spark/pull/10836 > Support for specifying version of R to use while creating sparkR libraries > -- > > Key: SPARK-12910 > URL: https://issues.apache.org/jira/browse/SPARK-12910 > Project: Spark > Issue Type: Improvement > Components: SparkR > Environment: Linux >Reporter: Shubhanshu Mishra >Priority: Minor > Labels: installation, sparkR > > When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. > However, a user might have locally installed their own version of R. There > should be a way to specify which R version to use. > I have fixed this in my code using the following patch: > ``` > $ git diff HEAD > diff --git a/R/README.md b/R/README.md > index 005f56d..99182e5 100644 > --- a/R/README.md > +++ b/R/README.md > @@ -1,6 +1,15 @@ > # R on Spark > > SparkR is an R package that provides a light-weight frontend to use Spark > from R. > +### Installing sparkR > + > +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be > done by running the script `$SPARK_HOME/R/install-dev.sh`. > +By default the above script uses the system wide installation of R. However, > this can be changed to any user installed location of R by giving the full > path of the `$R_HOME` as the first argument to the install-dev.sh script. > +Example: > +``` > +# where /home/username/R is where R is installed and /home/username/R/bin > contains the files R and RScript > +./install-dev.sh /home/username/R > +``` > > ### SparkR development > > diff --git a/R/install-dev.sh b/R/install-dev.sh > index 4972bb9..a8efa86 100755 > --- a/R/install-dev.sh > +++ b/R/install-dev.sh > @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib" > mkdir -p $LIB_DIR > > pushd $FWDIR > /dev/null > +if [ ! -z "$1" ] > + then > +R_HOME="$1/bin" > + else > +R_HOME="$(dirname $(which R))" > +fi > +echo "USING R_HOME = $R_HOME" > > # Generate Rd files if devtools is installed > -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }' > +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }' > > # Install SparkR to $LIB_DIR > -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ > +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ > > # Zip the SparkR package so that it can be distributed to worker nodes on > YARN > cd $LIB_DIR > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107534#comment-15107534 ] Apache Spark commented on SPARK-12869: -- User 'Fokko' has created a pull request for this issue: https://github.com/apache/spark/pull/10839 > Optimize conversion from BlockMatrix to IndexedRowMatrix > > > Key: SPARK-12869 > URL: https://issues.apache.org/jira/browse/SPARK-12869 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Fokko Driesprong > Original Estimate: 48h > Remaining Estimate: 48h > > In the current implementation of the BlockMatrix, the conversion to the > IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This > is somewhat ok when the matrix is very sparse, but for dense matrices this is > very inefficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12869: Assignee: (was: Apache Spark) > Optimize conversion from BlockMatrix to IndexedRowMatrix > > > Key: SPARK-12869 > URL: https://issues.apache.org/jira/browse/SPARK-12869 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Fokko Driesprong > Original Estimate: 48h > Remaining Estimate: 48h > > In the current implementation of the BlockMatrix, the conversion to the > IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This > is somewhat ok when the matrix is very sparse, but for dense matrices this is > very inefficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107533#comment-15107533 ] Fokko Driesprong commented on SPARK-12869: -- Hi guys, I've implemented an improved version of the toIndexedRowMatrix function on the BlockMatrix. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times: https://github.com/Fokko/BlockMatrixToIndexedRowMatrix The pull-request on Github: https://github.com/apache/spark/pull/10839 > Optimize conversion from BlockMatrix to IndexedRowMatrix > > > Key: SPARK-12869 > URL: https://issues.apache.org/jira/browse/SPARK-12869 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Fokko Driesprong > Original Estimate: 48h > Remaining Estimate: 48h > > In the current implementation of the BlockMatrix, the conversion to the > IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This > is somewhat ok when the matrix is very sparse, but for dense matrices this is > very inefficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12869: Assignee: Apache Spark > Optimize conversion from BlockMatrix to IndexedRowMatrix > > > Key: SPARK-12869 > URL: https://issues.apache.org/jira/browse/SPARK-12869 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Fokko Driesprong >Assignee: Apache Spark > Original Estimate: 48h > Remaining Estimate: 48h > > In the current implementation of the BlockMatrix, the conversion to the > IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This > is somewhat ok when the matrix is very sparse, but for dense matrices this is > very inefficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
[ https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12479: Assignee: (was: Apache Spark) > sparkR collect on GroupedData throws R error "missing value where > TRUE/FALSE needed" > -- > > Key: SPARK-12479 > URL: https://issues.apache.org/jira/browse/SPARK-12479 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Paulo Magalhaes > > sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" > Spark Version: 1.5.1 > R Version: 3.2.2 > I tracked down the root cause of this exception to an specific key for which > the hashCode could not be calculated. > The following code recreates the problem when ran in sparkR: > hashCode <- getFromNamespace("hashCode","SparkR") > hashCode("bc53d3605e8a5b7de1e8e271c2317645") > Error in if (value > .Machine$integer.max) { : > missing value where TRUE/FALSE needed > I went one step further and relaised the the problem happens because of the > bit wise shift below returning NA. > bitwShiftL(-1073741824,1) > where bitwShiftL is an R function. > I believe the bitwShiftL function is working as it is supposed to. Therefore, > this PR fixes it in the SparkR package: > https://github.com/apache/spark/pull/10436 > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
[ https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12479: Assignee: Apache Spark > sparkR collect on GroupedData throws R error "missing value where > TRUE/FALSE needed" > -- > > Key: SPARK-12479 > URL: https://issues.apache.org/jira/browse/SPARK-12479 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Paulo Magalhaes >Assignee: Apache Spark > > sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" > Spark Version: 1.5.1 > R Version: 3.2.2 > I tracked down the root cause of this exception to an specific key for which > the hashCode could not be calculated. > The following code recreates the problem when ran in sparkR: > hashCode <- getFromNamespace("hashCode","SparkR") > hashCode("bc53d3605e8a5b7de1e8e271c2317645") > Error in if (value > .Machine$integer.max) { : > missing value where TRUE/FALSE needed > I went one step further and relaised the the problem happens because of the > bit wise shift below returning NA. > bitwShiftL(-1073741824,1) > where bitwShiftL is an R function. > I believe the bitwShiftL function is working as it is supposed to. Therefore, > this PR fixes it in the SparkR package: > https://github.com/apache/spark/pull/10436 > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11295) Add packages to JUnit output for Python tests
[ https://issues.apache.org/jira/browse/SPARK-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11295: -- Assignee: Gabor Liptak > Add packages to JUnit output for Python tests > - > > Key: SPARK-11295 > URL: https://issues.apache.org/jira/browse/SPARK-11295 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Reporter: Gabor Liptak >Assignee: Gabor Liptak >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11295) Add packages to JUnit output for Python tests
[ https://issues.apache.org/jira/browse/SPARK-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11295: -- Target Version/s: 2.0.0 > Add packages to JUnit output for Python tests > - > > Key: SPARK-11295 > URL: https://issues.apache.org/jira/browse/SPARK-11295 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Reporter: Gabor Liptak >Assignee: Gabor Liptak >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11295) Add packages to JUnit output for Python tests
[ https://issues.apache.org/jira/browse/SPARK-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11295: -- Component/s: PySpark > Add packages to JUnit output for Python tests > - > > Key: SPARK-11295 > URL: https://issues.apache.org/jira/browse/SPARK-11295 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Reporter: Gabor Liptak >Assignee: Gabor Liptak >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107526#comment-15107526 ] Apache Spark commented on SPARK-6166: - User 'redsanket' has created a pull request for this issue: https://github.com/apache/spark/pull/10838 > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Assignee: Shixiong Zhu >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12845) During join Spark should pushdown predicates on joining column to both tables
[ https://issues.apache.org/jira/browse/SPARK-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107603#comment-15107603 ] Xiao Li commented on SPARK-12845: - Let me know if you hit any bug. Thanks! > During join Spark should pushdown predicates on joining column to both tables > - > > Key: SPARK-12845 > URL: https://issues.apache.org/jira/browse/SPARK-12845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > I have following issue. > I'm connecting two tables with where condition > {code} > select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 > {code} > In this code predicate is only push down to t1. > To have predicates on both table I should run following query: > {code} > select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = > 1234 > {code} > Spark should present same behaviour for both queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries
[ https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107369#comment-15107369 ] Apache Spark commented on SPARK-12910: -- User 'napsternxg' has created a pull request for this issue: https://github.com/apache/spark/pull/10836 > Support for specifying version of R to use while creating sparkR libraries > -- > > Key: SPARK-12910 > URL: https://issues.apache.org/jira/browse/SPARK-12910 > Project: Spark > Issue Type: Improvement > Components: SparkR > Environment: Linux >Reporter: Shubhanshu Mishra >Priority: Minor > Labels: installation, sparkR > > When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. > However, a user might have locally installed their own version of R. There > should be a way to specify which R version to use. > I have fixed this in my code using the following patch: > ``` > $ git diff HEAD > diff --git a/R/README.md b/R/README.md > index 005f56d..99182e5 100644 > --- a/R/README.md > +++ b/R/README.md > @@ -1,6 +1,15 @@ > # R on Spark > > SparkR is an R package that provides a light-weight frontend to use Spark > from R. > +### Installing sparkR > + > +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be > done by running the script `$SPARK_HOME/R/install-dev.sh`. > +By default the above script uses the system wide installation of R. However, > this can be changed to any user installed location of R by giving the full > path of the `$R_HOME` as the first argument to the install-dev.sh script. > +Example: > +``` > +# where /home/username/R is where R is installed and /home/username/R/bin > contains the files R and RScript > +./install-dev.sh /home/username/R > +``` > > ### SparkR development > > diff --git a/R/install-dev.sh b/R/install-dev.sh > index 4972bb9..a8efa86 100755 > --- a/R/install-dev.sh > +++ b/R/install-dev.sh > @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib" > mkdir -p $LIB_DIR > > pushd $FWDIR > /dev/null > +if [ ! -z "$1" ] > + then > +R_HOME="$1/bin" > + else > +R_HOME="$(dirname $(which R))" > +fi > +echo "USING R_HOME = $R_HOME" > > # Generate Rd files if devtools is installed > -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }' > +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }' > > # Install SparkR to $LIB_DIR > -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ > +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ > > # Zip the SparkR package so that it can be distributed to worker nodes on > YARN > cd $LIB_DIR > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10777) order by fails when column is aliased and projection includes windowed aggregate
[ https://issues.apache.org/jira/browse/SPARK-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10777: Assignee: Apache Spark > order by fails when column is aliased and projection includes windowed > aggregate > > > Key: SPARK-10777 > URL: https://issues.apache.org/jira/browse/SPARK-10777 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell >Assignee: Apache Spark > > This statement fails in SPARK (works fine in ORACLE, DB2 ) > select r as c1, min ( s ) over () as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by r > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'r' given input > columns c1, c2; line 3 pos 9 > SQLState: null > ErrorCode: 0 > Forcing the aliased column name works around the defect > select r as c1, min ( s ) over () as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by c1 > These work fine > select r as c1, min ( s ) over () as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by c1 > select r as c1, s as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by r > create table if not exists TINT ( RNUM int , CINT int ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' > STORED AS ORC ; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10777) order by fails when column is aliased and projection includes windowed aggregate
[ https://issues.apache.org/jira/browse/SPARK-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10777: Assignee: (was: Apache Spark) > order by fails when column is aliased and projection includes windowed > aggregate > > > Key: SPARK-10777 > URL: https://issues.apache.org/jira/browse/SPARK-10777 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell > > This statement fails in SPARK (works fine in ORACLE, DB2 ) > select r as c1, min ( s ) over () as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by r > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'r' given input > columns c1, c2; line 3 pos 9 > SQLState: null > ErrorCode: 0 > Forcing the aliased column name works around the defect > select r as c1, min ( s ) over () as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by c1 > These work fine > select r as c1, min ( s ) over () as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by c1 > select r as c1, s as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by r > create table if not exists TINT ( RNUM int , CINT int ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' > STORED AS ORC ; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10777) order by fails when column is aliased and projection includes windowed aggregate
[ https://issues.apache.org/jira/browse/SPARK-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107434#comment-15107434 ] Apache Spark commented on SPARK-10777: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/10678 > order by fails when column is aliased and projection includes windowed > aggregate > > > Key: SPARK-10777 > URL: https://issues.apache.org/jira/browse/SPARK-10777 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell > > This statement fails in SPARK (works fine in ORACLE, DB2 ) > select r as c1, min ( s ) over () as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by r > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'r' given input > columns c1, c2; line 3 pos 9 > SQLState: null > ErrorCode: 0 > Forcing the aliased column name works around the defect > select r as c1, min ( s ) over () as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by c1 > These work fine > select r as c1, min ( s ) over () as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by c1 > select r as c1, s as c2 from > ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t > order by r > create table if not exists TINT ( RNUM int , CINT int ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' > STORED AS ORC ; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9716) BinaryClassificationEvaluator should accept Double prediction column
[ https://issues.apache.org/jira/browse/SPARK-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9716. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10472 [https://github.com/apache/spark/pull/10472] > BinaryClassificationEvaluator should accept Double prediction column > > > Key: SPARK-9716 > URL: https://issues.apache.org/jira/browse/SPARK-9716 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Benjamin Fradet >Priority: Minor > Fix For: 2.0.0 > > > BinaryClassificationEvaluator currently expects the rawPrediction column, of > type Vector. It should also accept a Double prediction column, with a > different set of supported metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-6166: - Assignee: (was: Shixiong Zhu) > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12912) Add test suite for EliminateSubQueries
Reynold Xin created SPARK-12912: --- Summary: Add test suite for EliminateSubQueries Key: SPARK-12912 URL: https://issues.apache.org/jira/browse/SPARK-12912 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12912) Add test suite for EliminateSubQueries
[ https://issues.apache.org/jira/browse/SPARK-12912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107606#comment-15107606 ] Apache Spark commented on SPARK-12912: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/10837 > Add test suite for EliminateSubQueries > -- > > Key: SPARK-12912 > URL: https://issues.apache.org/jira/browse/SPARK-12912 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12912) Add test suite for EliminateSubQueries
[ https://issues.apache.org/jira/browse/SPARK-12912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12912: Assignee: Reynold Xin (was: Apache Spark) > Add test suite for EliminateSubQueries > -- > > Key: SPARK-12912 > URL: https://issues.apache.org/jira/browse/SPARK-12912 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12912) Add test suite for EliminateSubQueries
[ https://issues.apache.org/jira/browse/SPARK-12912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12912: Assignee: Apache Spark (was: Reynold Xin) > Add test suite for EliminateSubQueries > -- > > Key: SPARK-12912 > URL: https://issues.apache.org/jira/browse/SPARK-12912 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2750) Add Https support for Web UI
[ https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-2750. --- Resolution: Fixed Assignee: Fei Wang Fix Version/s: 2.0.0 > Add Https support for Web UI > > > Key: SPARK-2750 > URL: https://issues.apache.org/jira/browse/SPARK-2750 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Tao Wang >Assignee: Fei Wang > Labels: https, ssl, webui > Fix For: 2.0.0 > > Attachments: exception on yarn when https enabled.txt > > Original Estimate: 96h > Remaining Estimate: 96h > > Now I try to add https support for web ui using Jetty ssl integration.Below > is the plan: > 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User > can switch between https and http by configure "spark.http.policy" in JVM > property for each process, while choose http by default. > 2.Web port of Master and worker would be decided in order of launch > arguments, JVM property, System Env and default port. > 3.Below is some other configuration items: > {code} > spark.ssl.server.keystore.location The file or URL of the SSL Key store > spark.ssl.server.keystore.password The password for the key store > spark.ssl.server.keystore.keypassword The password (if any) for the specific > key within the key store > spark.ssl.server.keystore.type The type of the key store (default "JKS") > spark.client.https.need-auth True if SSL needs client authentication > spark.ssl.server.truststore.location The file name or URL of the trust store > location > spark.ssl.server.truststore.password The password for the trust store > spark.ssl.server.truststore.type The type of the trust store (default "JKS") > {code} > Any feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11295) Add packages to JUnit output for Python tests
[ https://issues.apache.org/jira/browse/SPARK-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11295. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 9263 [https://github.com/apache/spark/pull/9263] > Add packages to JUnit output for Python tests > - > > Key: SPARK-11295 > URL: https://issues.apache.org/jira/browse/SPARK-11295 > Project: Spark > Issue Type: Improvement > Components: Tests >Reporter: Gabor Liptak >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12845) During join Spark should pushdown predicates on joining column to both tables
[ https://issues.apache.org/jira/browse/SPARK-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107600#comment-15107600 ] Xiao Li commented on SPARK-12845: - I think the following PR resolves your issue: https://github.com/apache/spark/pull/10490 Right? > During join Spark should pushdown predicates on joining column to both tables > - > > Key: SPARK-12845 > URL: https://issues.apache.org/jira/browse/SPARK-12845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > I have following issue. > I'm connecting two tables with where condition > {code} > select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 > {code} > In this code predicate is only push down to t1. > To have predicates on both table I should run following query: > {code} > select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = > 1234 > {code} > Spark should present same behaviour for both queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries
Shubhanshu Mishra created SPARK-12910: - Summary: Support for specifying version of R to use while creating sparkR libraries Key: SPARK-12910 URL: https://issues.apache.org/jira/browse/SPARK-12910 Project: Spark Issue Type: Improvement Components: SparkR Environment: Linux Reporter: Shubhanshu Mishra Priority: Minor When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. However, a user might have locally installed their own version of R. There should be a way to specify which R version to use. I have fixed this in my code using the following patch: ``` $ git diff HEAD diff --git a/R/README.md b/R/README.md index 005f56d..99182e5 100644 --- a/R/README.md +++ b/R/README.md @@ -1,6 +1,15 @@ # R on Spark SparkR is an R package that provides a light-weight frontend to use Spark from R. +### Installing sparkR + +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be done by running the script `$SPARK_HOME/R/install-dev.sh`. +By default the above script uses the system wide installation of R. However, this can be changed to any user installed location of R by giving the full path of the `$R_HOME` as the first argument to the install-dev.sh script. +Example: +``` +# where /home/username/R is where R is installed and /home/username/R/bin contains the files R and RScript +./install-dev.sh /home/username/R +``` ### SparkR development diff --git a/R/install-dev.sh b/R/install-dev.sh index 4972bb9..a8efa86 100755 --- a/R/install-dev.sh +++ b/R/install-dev.sh @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib" mkdir -p $LIB_DIR pushd $FWDIR > /dev/null +if [ ! -z "$1" ] + then +R_HOME="$1/bin" + else +R_HOME="$(dirname $(which R))" +fi +echo "USING R_HOME = $R_HOME" # Generate Rd files if devtools is installed -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }' +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }' # Install SparkR to $LIB_DIR -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ # Zip the SparkR package so that it can be distributed to worker nodes on YARN cd $LIB_DIR ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12816) Schema generation for type aliases does not work
[ https://issues.apache.org/jira/browse/SPARK-12816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12816. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10749 [https://github.com/apache/spark/pull/10749] > Schema generation for type aliases does not work > > > Key: SPARK-12816 > URL: https://issues.apache.org/jira/browse/SPARK-12816 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Jakob Odersky > Fix For: 2.0.0 > > > Related to the second part of SPARK-12777. > Assume the following: > {code} > case class Container[A](a: A) > type IntContainer = Container[Int] > {code} > Generating a schema with > {code}org.apache.spark.sql.catalyst.ScalaReflection.schemaFor[IntContainer]{code} > fails miserably with {{NoSuchElementException: : head of empty list > (ScalaReflection.scala:504)}} (the same exception as described in the related > issues) > Since {{schemaFor}} is called whenever a schema is implicitly needed, > {{Datasets}} cannot be created from certain aliased types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6
Jesse English created SPARK-12911: - Summary: Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6 Key: SPARK-12911 URL: https://issues.apache.org/jira/browse/SPARK-12911 Project: Spark Issue Type: Bug Affects Versions: 1.6.0 Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0 Reporter: Jesse English When doing a *where* operation on a dataframe and testing for equality on an array type, after 1.6 no valid comparisons are made if the dataframe has been cached. If it has not been cached, the results are as expected. This appears to be related to the underlying unsafe array data types. {code:title=test.scala|borderStyle=solid} test("test array comparison") { val vectors: Vector[Row] = Vector( Row.fromTuple("id_1" -> Array(0L, 2L)), Row.fromTuple("id_2" -> Array(0L, 5L)), Row.fromTuple("id_3" -> Array(0L, 9L)), Row.fromTuple("id_4" -> Array(1L, 0L)), Row.fromTuple("id_5" -> Array(1L, 8L)), Row.fromTuple("id_6" -> Array(2L, 4L)), Row.fromTuple("id_7" -> Array(5L, 6L)), Row.fromTuple("id_8" -> Array(6L, 2L)), Row.fromTuple("id_9" -> Array(7L, 0L)) ) val data: RDD[Row] = sc.parallelize(vectors, 3) val schema = StructType( StructField("id", StringType, false) :: StructField("point", DataTypes.createArrayType(LongType, false), false) :: Nil ) val sqlContext = new SQLContext(sc) val dataframe = sqlContext.createDataFrame(data, schema) val targetPoint:Array[Long] = Array(0L,9L) //Cacheing is the trigger to cause the error (no cacheing causes no error) dataframe.cache() //This is the line where it fails //java.util.NoSuchElementException: next on empty iterator //However we know that there is a valid match val targetRow = dataframe.where(dataframe("point") === array(targetPoint.map(value => lit(value)): _*)).first() assert(targetRow != null) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12790) Remove HistoryServer old multiple files format
[ https://issues.apache.org/jira/browse/SPARK-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12790: -- Assignee: Felix Cheung > Remove HistoryServer old multiple files format > -- > > Key: SPARK-12790 > URL: https://issues.apache.org/jira/browse/SPARK-12790 > Project: Spark > Issue Type: Sub-task > Components: Deploy >Reporter: Andrew Or >Assignee: Felix Cheung > > HistoryServer has 2 formats. The old one makes a directory and puts multiple > files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just > 1 file called local_2593759238651.log or something. > It's been a nightmare to maintain both code paths. We should just remove the > old legacy format (which has been out of use for many versions now) when we > still have the chance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12790) Remove HistoryServer old multiple files format
[ https://issues.apache.org/jira/browse/SPARK-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107167#comment-15107167 ] Andrew Or commented on SPARK-12790: --- also updating all the tests that rely on the old format. If you look under core/src/test/resources there are a bunch of those > Remove HistoryServer old multiple files format > -- > > Key: SPARK-12790 > URL: https://issues.apache.org/jira/browse/SPARK-12790 > Project: Spark > Issue Type: Sub-task > Components: Deploy >Reporter: Andrew Or > > HistoryServer has 2 formats. The old one makes a directory and puts multiple > files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just > 1 file called local_2593759238651.log or something. > It's been a nightmare to maintain both code paths. We should just remove the > old legacy format (which has been out of use for many versions now) when we > still have the chance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12790) Remove HistoryServer old multiple files format
[ https://issues.apache.org/jira/browse/SPARK-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107169#comment-15107169 ] Andrew Or commented on SPARK-12790: --- I've assigned this to you. > Remove HistoryServer old multiple files format > -- > > Key: SPARK-12790 > URL: https://issues.apache.org/jira/browse/SPARK-12790 > Project: Spark > Issue Type: Sub-task > Components: Deploy >Reporter: Andrew Or >Assignee: Felix Cheung > > HistoryServer has 2 formats. The old one makes a directory and puts multiple > files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just > 1 file called local_2593759238651.log or something. > It's been a nightmare to maintain both code paths. We should just remove the > old legacy format (which has been out of use for many versions now) when we > still have the chance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12907) Use BitSet to represent null fields in ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-12907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12907: Assignee: (was: Apache Spark) > Use BitSet to represent null fields in ColumnVector > --- > > Key: SPARK-12907 > URL: https://issues.apache.org/jira/browse/SPARK-12907 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > Use bit vectors (BitSet) to represent null fields information in ColumnVector > to reduce memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12907) Use BitSet to represent null fields in ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-12907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12907: Assignee: Apache Spark > Use BitSet to represent null fields in ColumnVector > --- > > Key: SPARK-12907 > URL: https://issues.apache.org/jira/browse/SPARK-12907 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark >Priority: Minor > > Use bit vectors (BitSet) to represent null fields information in ColumnVector > to reduce memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12887) Do not expose var's in TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-12887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12887. --- Resolution: Fixed Fix Version/s: 2.0.0 > Do not expose var's in TaskMetrics > -- > > Key: SPARK-12887 > URL: https://issues.apache.org/jira/browse/SPARK-12887 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > TaskMetrics has a bunch of var's, some are fully public, some are > private[spark]. This is bad coding style that makes it easy to accidentally > overwrite previously set metrics. This has happened a few times in the past > and caused bugs that were difficult to debug. > Instead, we should have get-or-create semantics, which are more readily > understandable. This makes sense in the case of TaskMetrics because these are > just aggregated metrics that we want to collect throughout the task, so it > doesn't matter *who*'s incrementing them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12907) Use BitSet to represent null fields in ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-12907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107200#comment-15107200 ] Apache Spark commented on SPARK-12907: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/10833 > Use BitSet to represent null fields in ColumnVector > --- > > Key: SPARK-12907 > URL: https://issues.apache.org/jira/browse/SPARK-12907 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > Use bit vectors (BitSet) to represent null fields information in ColumnVector > to reduce memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12804) ml.classification.LogisticRegression fails when FitIntercept with same-label dataset
[ https://issues.apache.org/jira/browse/SPARK-12804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-12804. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10743 [https://github.com/apache/spark/pull/10743] > ml.classification.LogisticRegression fails when FitIntercept with same-label > dataset > > > Key: SPARK-12804 > URL: https://issues.apache.org/jira/browse/SPARK-12804 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 >Reporter: Feynman Liang >Assignee: Feynman Liang > Fix For: 2.0.0 > > > When training LogisticRegression on a dataset where the label is all 0 or all > 1, an array out of bounds exception is thrown. The problematic code is > {code} > initialCoefficientsWithIntercept.toArray(numFeatures) > = math.log(histogram(1) / histogram(0)) > } > {code} > The correct behaviour is to short-circuit training entirely when only a > single label is present (can be detected from {{labelSummarizer}}) and return > a classifier which assigns all true/false with infinite weights. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12895) Implement TaskMetrics using accumulators
[ https://issues.apache.org/jira/browse/SPARK-12895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12895: Assignee: Apache Spark (was: Andrew Or) > Implement TaskMetrics using accumulators > > > Key: SPARK-12895 > URL: https://issues.apache.org/jira/browse/SPARK-12895 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Andrew Or >Assignee: Apache Spark > > We need to first do this before we can avoid sending TaskMetrics from the > executors to the driver. After we do this, we can send only accumulator > updates instead of both that AND TaskMetrics. > By the end of this issue TaskMetrics will be a wrapper of accumulators. It > will be only syntactic sugar for setting these accumulators. > But first, we need to express everything in TaskMetrics as accumulators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12895) Implement TaskMetrics using accumulators
[ https://issues.apache.org/jira/browse/SPARK-12895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12895: Assignee: Andrew Or (was: Apache Spark) > Implement TaskMetrics using accumulators > > > Key: SPARK-12895 > URL: https://issues.apache.org/jira/browse/SPARK-12895 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > > We need to first do this before we can avoid sending TaskMetrics from the > executors to the driver. After we do this, we can send only accumulator > updates instead of both that AND TaskMetrics. > By the end of this issue TaskMetrics will be a wrapper of accumulators. It > will be only syntactic sugar for setting these accumulators. > But first, we need to express everything in TaskMetrics as accumulators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11944) Python API for mllib.clustering.BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-11944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-11944. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10150 [https://github.com/apache/spark/pull/10150] > Python API for mllib.clustering.BisectingKMeans > --- > > Key: SPARK-11944 > URL: https://issues.apache.org/jira/browse/SPARK-11944 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Reporter: Yanbo Liang >Assignee: holdenk >Priority: Minor > Fix For: 2.0.0 > > > Add Python API for mllib.clustering.BisectingKMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"
[ https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107151#comment-15107151 ] Andrew Or commented on SPARK-12485: --- [~srowen] to answer your question no I don't feel super strongly about changing it. Naming is difficult in general and I think both "dynamic allocation" and "elastic scaling" do mean roughly the same thing. It's just that I slightly prefer the latter (or something shorter) after giving a few talks on this topic and chatting with a few people about it in real life. I'm also totally cool with closing this as a Won't Fix if you or [~markhamstra] prefer. > Rename "dynamic allocation" to "elastic scaling" > > > Key: SPARK-12485 > URL: https://issues.apache.org/jira/browse/SPARK-12485 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > > Fewer syllables, sounds more natural. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12870) better format bucket id in file name
[ https://issues.apache.org/jira/browse/SPARK-12870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-12870. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10799 [https://github.com/apache/spark/pull/10799] > better format bucket id in file name > > > Key: SPARK-12870 > URL: https://issues.apache.org/jira/browse/SPARK-12870 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107267#comment-15107267 ] Mark Grover commented on SPARK-12177: - Thanks Mario! bq. We should also have a python/pyspark/streaming/kafka-v09.py as well that matches to our external/kafka-v09 I agree, I will look into this. bq. Why do you have the Broker.scala class? Unless i am missing something, it should be knocked off Yeah, I noticed that too and I agree. This should be pretty simple to take out. I also [noticed|https://issues.apache.org/jira/browse/SPARK-12177?focusedCommentId=15089750=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15089750] that the v09 example picking up some Kafka v08 jars so I am working on fixing that too. bq. I think the package should be 'org.apache.spark.streaming.kafka' only in external/kafka-v09 and not 'org.apache.spark.streaming.kafka.v09'. This is because we produce a jar with a diff name (user picks which one and even if he/she mismatches, it errors correctly since the KafkaUtils method signatures are different) I totally understand what you mean. However, kafka has its [own assembly in Spark|https://github.com/apache/spark/tree/master/external/kafka-assembly] and the way the code is structured right now, both the new API and old API would go in the same assembly so it's important to have a different package name. Also, I think for our end users transitioning from old to new API, I foresee them having 2 versions of their spark-kafka app. One that works with the old API and one with the new API. And, I think it would be an easier transition if they could include both the kafka API versions in the spark classpath and pick and choose which app to run without mucking with maven dependencies and re-compiling when they want to switch. Let me know if you disagree. > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12907) Use BitSet to represent null fields in ColumnVector
Kazuaki Ishizaki created SPARK-12907: Summary: Use BitSet to represent null fields in ColumnVector Key: SPARK-12907 URL: https://issues.apache.org/jira/browse/SPARK-12907 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Kazuaki Ishizaki Priority: Minor Use bit vectors (BitSet) to represent null fields information in ColumnVector to reduce memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107203#comment-15107203 ] Josh Rosen commented on SPARK-12906: Ping [~zsxwing], since I know you've looked into similar leaks in the past. > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: dump1.PNG, screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12895) Implement TaskMetrics using accumulators
[ https://issues.apache.org/jira/browse/SPARK-12895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107358#comment-15107358 ] Apache Spark commented on SPARK-12895: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/10835 > Implement TaskMetrics using accumulators > > > Key: SPARK-12895 > URL: https://issues.apache.org/jira/browse/SPARK-12895 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > > We need to first do this before we can avoid sending TaskMetrics from the > executors to the driver. After we do this, we can send only accumulator > updates instead of both that AND TaskMetrics. > By the end of this issue TaskMetrics will be a wrapper of accumulators. It > will be only syntactic sugar for setting these accumulators. > But first, we need to express everything in TaskMetrics as accumulators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12870) better format bucket id in file name
[ https://issues.apache.org/jira/browse/SPARK-12870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12870: - Assignee: Wenchen Fan > better format bucket id in file name > > > Key: SPARK-12870 > URL: https://issues.apache.org/jira/browse/SPARK-12870 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"
[ https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-12485. --- Resolution: Won't Fix I talked to Andrew more offline. Looks like this name isn't so bad that we have to change it. Let's just keep it for now. Thanks. > Rename "dynamic allocation" to "elastic scaling" > > > Key: SPARK-12485 > URL: https://issues.apache.org/jira/browse/SPARK-12485 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > > Fewer syllables, sounds more natural. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12908) Add tests to make sure that ml.classification.LogisticRegression returns meaningful result when labels are the same without intercept
DB Tsai created SPARK-12908: --- Summary: Add tests to make sure that ml.classification.LogisticRegression returns meaningful result when labels are the same without intercept Key: SPARK-12908 URL: https://issues.apache.org/jira/browse/SPARK-12908 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.6.0 Reporter: DB Tsai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107061#comment-15107061 ] John Vines commented on SPARK-12650: SPARK_SUBMIT_OPTS seems to work. -Xmx256m changed the heap settings for SparkSubmitJob, but left the driver alone and did not appear to cause the same conflict in the executors as mentioned above. I also did not see any logging about that setting (unlike SPARK_JAVA_OPTS which I mentioned above). > No means to specify Xmx settings for SparkSubmit in yarn-cluster mode > - > > Key: SPARK-12650 > URL: https://issues.apache.org/jira/browse/SPARK-12650 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.2 > Environment: Hadoop 2.6.0 >Reporter: John Vines > > Background- > I have an app master designed to do some work and then launch a spark job. > Issue- > If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, > leading to the jvm taking a default heap which is relatively large. This > causes a large amount of vmem to be taken, so that it is killed by yarn. This > can be worked around by disabling Yarn's vmem check, but that is a hack. > If I run it in yarn-client mode, it's fine as long as my container has enough > space for the driver, which is manageable. But I feel that the utter lack of > Xmx settings for what I believe is a very small jvm is a problem. > I believe this was introduced with the fix for SPARK-3884 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12783) Dataset map serialization error
[ https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107178#comment-15107178 ] Wenchen Fan commented on SPARK-12783: - hi [~babloo80], can you move `MyMap` and `TestCaseClass` to top level(don't make them inner class) and try again? I can't reproduce your failure locally... > Dataset map serialization error > --- > > Key: SPARK-12783 > URL: https://issues.apache.org/jira/browse/SPARK-12783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Muthu Jayakumar >Assignee: Wenchen Fan >Priority: Critical > > When Dataset API is used to map to another case class, an error is thrown. > {code} > case class MyMap(map: Map[String, String]) > case class TestCaseClass(a: String, b: String){ > def toMyMap: MyMap = { > MyMap(Map(a->b)) > } > def toStr: String = { > a > } > } > //Main method section below > import sqlContext.implicits._ > val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), > TestCaseClass("2015-05-01", "data2"))).toDF() > df1.as[TestCaseClass].map(_.toStr).show() //works fine > df1.as[TestCaseClass].map(_.toMyMap).show() //fails > {code} > Error message: > {quote} > Caused by: java.io.NotSerializableException: > scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1 > Serialization stack: > - object not serializable (class: > scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: > package lang) > - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: > class scala.reflect.internal.Symbols$Symbol) > - object (class scala.reflect.internal.Types$UniqueThisType, > java.lang.type) > - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: > class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String) > - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, > type: class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String) > - field (class: > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, > type: class scala.reflect.api.Types$TypeApi) > - object (class > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, ) > - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, > name: function, type: interface scala.Function1) > - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, > mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType)) > - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: > targetObject, type: class > org.apache.spark.sql.catalyst.expressions.Expression) > - object (class org.apache.spark.sql.catalyst.expressions.Invoke, > invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object;))) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@4c7e3aab) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object;)), > invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: > "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class > [Ljava.lang.Object; > - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, > name: arguments, type: interface scala.collection.Seq) > - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, > staticinvoke(class > org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface > scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- > field (class: "scala.collection.immutable.Map", name: "map"),- root class: >
[jira] [Updated] (SPARK-12908) Add tests to make sure that ml.classification.LogisticRegression returns meaningful result when labels are the same without intercept
[ https://issues.apache.org/jira/browse/SPARK-12908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-12908: Description: This will be only adding new tests as followup PR to https://github.com/apache/spark/pull/10743 > Add tests to make sure that ml.classification.LogisticRegression returns > meaningful result when labels are the same without intercept > - > > Key: SPARK-12908 > URL: https://issues.apache.org/jira/browse/SPARK-12908 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: DB Tsai > > This will be only adding new tests as followup PR to > https://github.com/apache/spark/pull/10743 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12883) 1.6 Dynamic allocation document for removing executors with cached data differs in different sections
[ https://issues.apache.org/jira/browse/SPARK-12883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107085#comment-15107085 ] Saisai Shao commented on SPARK-12883: - I get your point now. But I think these two descriptions are still both valid, the first paragraph describes the result of data cached executor removing, and the second paragraph says how to workaround this problem. Maybe just different understanding from different people. > 1.6 Dynamic allocation document for removing executors with cached data > differs in different sections > - > > Key: SPARK-12883 > URL: https://issues.apache.org/jira/browse/SPARK-12883 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Manoj Samel >Priority: Trivial > > Spark 1.6 dynamic allocation documentation still refers to 1.2. > See text "There is currently not yet a solution for this in Spark 1.2. In > future releases, the cached data may be preserved through an off-heap storage > similar in spirit to how shuffle files are preserved through the external > shuffle service" > It appears 1.6 has parameter to address cache executor > spark.dynamicAllocation.cachedExecutorIdleTimeout with default value as > infinity. > Pl update 1.6 documentation to refer to latest release and features -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12883) 1.6 Dynamic allocation document for removing executors with cached data differs in different sections
[ https://issues.apache.org/jira/browse/SPARK-12883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107057#comment-15107057 ] Manoj Samel commented on SPARK-12883: - Updated Jira subject for more accurate reflection of the issue > 1.6 Dynamic allocation document for removing executors with cached data > differs in different sections > - > > Key: SPARK-12883 > URL: https://issues.apache.org/jira/browse/SPARK-12883 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Manoj Samel >Priority: Trivial > > Spark 1.6 dynamic allocation documentation still refers to 1.2. > See text "There is currently not yet a solution for this in Spark 1.2. In > future releases, the cached data may be preserved through an off-heap storage > similar in spirit to how shuffle files are preserved through the external > shuffle service" > It appears 1.6 has parameter to address cache executor > spark.dynamicAllocation.cachedExecutorIdleTimeout with default value as > infinity. > Pl update 1.6 documentation to refer to latest release and features -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12826) Spark Workers do not attempt reconnect or exit on connection failure.
[ https://issues.apache.org/jira/browse/SPARK-12826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Braithwaite updated SPARK-12826: - Priority: Critical (was: Major) > Spark Workers do not attempt reconnect or exit on connection failure. > - > > Key: SPARK-12826 > URL: https://issues.apache.org/jira/browse/SPARK-12826 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Alan Braithwaite >Priority: Critical > > Spark version 1.6.0 Hadoop 2.6.0 CDH 5.4.2 > We're running behind a tcp proxy (10.14.12.11:7077 is the tcp proxy listen > address in the example, upstreaming to the spark master listening on 9682 and > a different IP) > To reproduce, I started a spark worker, let it successfully connect to the > master through the proxy, then tcpkill'd the connection on the Worker. > Nothing is logged from the code handling reconnection attempts. > {code} > 16/01/14 18:23:30 INFO Worker: Connecting to master > spark-master.example.com:7077... > 16/01/14 18:23:30 DEBUG TransportClientFactory: Creating new connection to > spark-master.example.com/10.14.12.11:7077 > 16/01/14 18:23:30 DEBUG TransportClientFactory: Connection to > spark-master.example.com/10.14.12.11:7077 successful, running bootstraps... > 16/01/14 18:23:30 DEBUG TransportClientFactory: Successfully created > connection to spark-master.example.com/10.14.12.11:7077 after 1 ms (0 ms > spent in bootstraps) > 16/01/14 18:23:30 DEBUG Recycler: -Dio.netty.recycler.maxCapacity.default: > 262144 > 16/01/14 18:23:30 INFO Worker: Successfully registered with master > spark://0.0.0.0:9682 > 16/01/14 18:23:30 INFO Worker: Worker cleanup enabled; old application > directories will be deleted in: /var/lib/spark/work > 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false > viewAcls=spark > 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false > viewAcls=spark > 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false > viewAcls=spark > 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false > viewAcls=spark > 16/01/14 18:41:31 WARN TransportChannelHandler: Exception in connection from > spark-master.example.com/10.14.12.11:7077 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) > at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > -- nothing more is logged, going on 15 minutes -- > $ ag -C5 Disconn > core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala > 313registrationRetryTimer.foreach(_.cancel(true)) > 314registrationRetryTimer = None > 315 } > 316 > 317 private def registerWithMaster() { > 318// onDisconnected may be triggered multiple times, so don't attempt > registration > 319// if there are outstanding registration attempts scheduled. > 320registrationRetryTimer match { > 321 case None => > 322registered = false > 323registerMasterFutures = tryRegisterAllMasters() > -- > 549finishedExecutors.values.toList, drivers.values.toList, > 550finishedDrivers.values.toList, activeMasterUrl, cores, memory, > 551coresUsed, memoryUsed, activeMasterWebUiUrl)) > 552 } > 553 > 554 override def onDisconnected(remoteAddress: RpcAddress): Unit = { > 555if (master.exists(_.address == remoteAddress)) { > 556 logInfo(s"$remoteAddress Disassociated !") > 557 masterDisconnected() > 558} > 559 } > 560 > 561 private def masterDisconnected() { > 562logError("Connection to master failed! Waiting for master to > reconnect...") > 563connected = false > 564registerWithMaster() > 565 } >
[jira] [Commented] (SPARK-12546) Writing to partitioned parquet table can fail with OOM
[ https://issues.apache.org/jira/browse/SPARK-12546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107258#comment-15107258 ] Nong Li commented on SPARK-12546: - A better workaround might be to figure the max number of concurrent output files to 1. This can be done by setting "spark.sql.sources.maxConcurrentWrites=1" > Writing to partitioned parquet table can fail with OOM > -- > > Key: SPARK-12546 > URL: https://issues.apache.org/jira/browse/SPARK-12546 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Nong Li > > It is possible to have jobs fail with OOM when writing to a partitioned > parquet table. While this was probably always possible, it is more likely in > 1.6 due to the memory manager changes. The unified memory manager enables > Spark to use more of the process memory (in particular, for execution) which > gets us in this state more often. This issue can happen for libraries that > consume a lot of memory, such as parquet. Prior to 1.6, these libraries would > more likely use memory that spark was not using (i.e. from the storage pool). > In 1.6, this storage memory can now be used for execution. > There are a couple of configs that can help with this issue. > - parquet.memory.pool.ratio: This is a parquet config on how much of the > heap the parquet writers should use. This default to .95. Consider a much > lower value (e.g. 0.1) > - spark.memory.faction: This is a spark config to control how much of the > memory should be allocated to spark. Consider setting this to 0.6. > This should cause jobs to potentially spill more but require less memory. > More aggressive tuning will control this trade off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12867) Nullability of Intersect can be stricter
[ https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-12867. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10812 [https://github.com/apache/spark/pull/10812] > Nullability of Intersect can be stricter > > > Key: SPARK-12867 > URL: https://issues.apache.org/jira/browse/SPARK-12867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Lian >Assignee: Xiao Li >Priority: Minor > Fix For: 2.0.0 > > > {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as: > {code} > override def output: Seq[Attribute] = > left.output.zip(right.output).map { case (leftAttr, rightAttr) => > leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable) > } > {code} > However, we can replace the {{||}} with {{&&}} for intersection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12909) Spark on Mesos accessing Secured HDFS w/Kerberos
Greg Senia created SPARK-12909: -- Summary: Spark on Mesos accessing Secured HDFS w/Kerberos Key: SPARK-12909 URL: https://issues.apache.org/jira/browse/SPARK-12909 Project: Spark Issue Type: New Feature Components: Mesos Reporter: Greg Senia Ability for Spark on Mesos to use a Kerberized HDFS FileSystem for data It seems like this is not possible based on email chains and forum articles? If these are true how hard would it be to get this implemented I'm willing to try to help. https://community.hortonworks.com/questions/5415/spark-on-yarn-vs-mesos.html https://www.mail-archive.com/user@spark.apache.org/msg31326.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12797) Aggregation without grouping keys
[ https://issues.apache.org/jira/browse/SPARK-12797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12797: Assignee: Apache Spark > Aggregation without grouping keys > - > > Key: SPARK-12797 > URL: https://issues.apache.org/jira/browse/SPARK-12797 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12770) Implement rules for branch elimination for CaseWhen in SimplifyConditionals
[ https://issues.apache.org/jira/browse/SPARK-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12770. - Resolution: Fixed Fix Version/s: 2.0.0 > Implement rules for branch elimination for CaseWhen in SimplifyConditionals > --- > > Key: SPARK-12770 > URL: https://issues.apache.org/jira/browse/SPARK-12770 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > There are a few things we can do: > 1. If the first branch's condition is a true literal, remove the CaseWhen and > use the value from that branch. > 2. If a branch's condition is a false or null literal, remove that branch. > 3. If only the else branch is left, remove the CaseWhen and use the value > from the else branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12168) Need test for conflicted function in R
[ https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-12168: -- Assignee: Felix Cheung > Need test for conflicted function in R > -- > > Key: SPARK-12168 > URL: https://issues.apache.org/jira/browse/SPARK-12168 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Minor > Fix For: 2.0.0 > > > Currently it is hard to know if a function in base or stats packages are > masked when add new function in SparkR. > Having an automated test would make it easier to track such changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12168) Need test for conflicted function in R
[ https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-12168. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10171 [https://github.com/apache/spark/pull/10171] > Need test for conflicted function in R > -- > > Key: SPARK-12168 > URL: https://issues.apache.org/jira/browse/SPARK-12168 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Priority: Minor > Fix For: 2.0.0 > > > Currently it is hard to know if a function in base or stats packages are > masked when add new function in SparkR. > Having an automated test would make it easier to track such changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12898) Consider having dummyCallSite for HiveTableScan
[ https://issues.apache.org/jira/browse/SPARK-12898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated SPARK-12898: - Attachment: callsiteProf.png > Consider having dummyCallSite for HiveTableScan > --- > > Key: SPARK-12898 > URL: https://issues.apache.org/jira/browse/SPARK-12898 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Rajesh Balamohan > Attachments: callsiteProf.png > > > Currently, it runs with getCallSite which is really expensive and shows up > when scanning through large table with partitions (e.g TPC-DS). It would be > good to consider having dummyCallSite in HiveTableScan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12864) initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached
[ https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107859#comment-15107859 ] iward commented on SPARK-12864: --- The important point of the idea is to fix this conflict executor id. As the task log show, if the data of current task to shuffle is not found, it will throw a FetchFailedException. So, I think the mechanism in AM restarted case is continue to run, it won't recomputed, if the data computed is not found, it will throw a FetchFailedException. And I have run a test that it will normally continue to run. > initialize executorIdCounter after ApplicationMaster killed for max number > of executor failures reached > > > Key: SPARK-12864 > URL: https://issues.apache.org/jira/browse/SPARK-12864 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.1, 1.4.1, 1.5.2 >Reporter: iward > > Currently, when max number of executor failures reached the > *maxNumExecutorFailures*, *ApplicationMaster* will be killed and re-register > another one.This time, *YarnAllocator* will be created a new instance. > But, the value of property *executorIdCounter* in *YarnAllocator* will reset > to *0*. Then the *Id* of new executor will starting from 1. This will confuse > with the executor has already created before, which will cause > FetchFailedException. > For example, the following is the task log: > {noformat} > 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN > YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has > disassociated: 172.22.92.14:45125 > 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO > YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as > AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604]) > {noformat} > {noformat} > 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: > Registered executor: > AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793]) > with ID 1 > {noformat} > {noformat} > Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): > FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337 > ), shuffleId=5, mapId=2, reduceId=3, message= > 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: > /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl > e_5_2_0.index (No such file or directory) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 2015-12-22 02:43:20 INFO at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > 2015-12-22 02:43:20 INFO at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > 2015-12-22 02:43:20 INFO at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154) > 2015-12-22 02:43:20 INFO at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > {noformat} > As the task log show, the executor id of *BJHC-HERA-16217.hadoop.jd.local* > is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and > cause FetchFailedException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12669) Organize options for default values
[ https://issues.apache.org/jira/browse/SPARK-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107887#comment-15107887 ] Mohit Jaggi commented on SPARK-12669: - hmm...wouldn't it be good to have a typesafe API as well in addition to this one? It can be a utility on top of this API. Maps are a bit hard to use as you don't get auto-completion from IDEs, no compile time checks etc. > Organize options for default values > --- > > Key: SPARK-12669 > URL: https://issues.apache.org/jira/browse/SPARK-12669 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > CSV data source in SparkSQL should be able to differentiate empty string, > null, NaN, “N/A” (maybe data type dependent). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12913) Reimplement all builtin aggregate functions as declarative function
Davies Liu created SPARK-12913: -- Summary: Reimplement all builtin aggregate functions as declarative function Key: SPARK-12913 URL: https://issues.apache.org/jira/browse/SPARK-12913 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu As benchmarked and discussed here: https://github.com/apache/spark/pull/10786/files#r50038294. Benefits from codegen, the declarative aggregate function could be much faster than imperative one, we should re-implement all the builtin aggregate functions as declarative one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org