[jira] [Commented] (SPARK-13861) TPCDS query 40 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198526#comment-15198526 ] Xiao Li commented on SPARK-13861: - Great job! I am just wondering if only cs_sales_price has a wrong definition? > TPCDS query 40 returns wrong results compared to TPC official result set > - > > Key: SPARK-13861 > URL: https://issues.apache.org/jira/browse/SPARK-13861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 40 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABBD) ; I believe 5 > rows are missing in total. > Actual results: > {noformat} > [TN,AABD,0.0,-82.060899353] > [TN,AACD,-216.54000234603882,158.0399932861328] > [TN,AAHD,186.54999542236328,0.0] > [TN,AALA,0.0,48.2254223633] > [TN,ACGC,63.67999863624573,0.0] > [TN,ACHC,102.6830517578,51.8838964844] > [TN,ACKC,128.9235150146,44.8169482422] > [TN,ACLD,205.43999433517456,-948.619930267334] > [TN,ACOB,207.32000732421875,24.88389648438] > [TN,ACPD,87.75,53.9900016784668] > [TN,ADGB,44.310001373291016,222.4800033569336] > [TN,ADKB,0.0,-471.8699951171875] > [TN,AEAD,58.2400016784668,0.0] > [TN,AEOC,19.9084741211,214.7076293945] > [TN,AFAC,271.8199977874756,163.1699981689453] > [TN,AFAD,2.349046325684,28.3169482422] > [TN,AFDC,-378.0499496459961,-303.26999282836914] > [TN,AGID,307.6099967956543,-19.29915527344] > [TN,AHDE,80.574468689,-476.7200012207031] > [TN,AHHA,8.27457763672,155.1276565552] > [TN,AHJB,39.23999857902527,0.0] > [TN,AIEC,82.3675750732,3.910858306885] > [TN,AIEE,20.39618530273,-151.08999633789062] > [TN,AIMC,24.46313354492,-150.330517578] > [TN,AJAC,49.0915258789,82.084741211] > [TN,AJCA,121.18000221252441,63.779998779296875] > [TN,AJKB,27.94534057617,8.97267028809] > [TN,ALBE,88.2599983215332,30.22542236328] > [TN,ALCE,93.5245776367,92.0198092651] > [TN,ALEC,64.179019165,15.1584741211] > [TN,ALNB,4.19809265137,148.27000427246094] > [TN,AMBE,28.44534057617,0.0] > [TN,AMPB,0.0,131.92999839782715] > [TN,ANFE,0.0,-137.3400115966797] > [TN,AOIB,150.40999603271484,254.288058548] > [TN,APJB,45.2745776367,334.482015991] > [TN,APLA,50.2076293945,29.150001049041748] > [TN,APLD,0.0,32.3838964844] > [TN,BAPD,93.41999816894531,145.8699951171875] > [TN,BBID,296.774577637,30.95084472656] > [TN,BDCE,-1771.0800704956055,-54.779998779296875] > [TN,BDDD,111.12000274658203,280.5899963378906] > [TN,BDJA,0.0,79.5423706055] > [TN,BEFD,0.0,3.429475479126] > [TN,BEOD,269.838964844,297.5800061225891] > [TN,BFMB,110.82999801635742,-941.4000930786133] > [TN,BFNA,47.8661035156,0.0] > [TN,BFOC,46.3415258789,83.5245776367] > [TN,BHPC,27.378392334,77.61999893188477] > [TN,BIDB,196.6199951171875,5.57171661377] > [TN,BIGB,425.3399963378906,0.0] > [TN,BIJB,209.6300048828125,0.0] > [TN,BJFE,7.32923706055,55.1584741211] > [TN,BKFA,0.0,138.14000129699707] > [TN,BKMC,27.17076293945,54.970001220703125] > [TN,BLDE,170.28999400138855,0.0] > [TN,BNHB,58.0594277954,-337.8899841308594] > [TN,BNID,54.41525878906,35.01504089355] > [TN,BNLA,0.0,168.37999629974365] > [TN,BNLD,0.0,96.4084741211] > [TN,BNMC,202.40999698638916,49.52999830245972] > [TN,BOCC,4.73019073486,69.83999633789062] > [TN,BOMB,63.66999816894531,163.49000668525696] > [TN,CAAA,121.91000366210938,0.0] > [TN,CAAD,-1107.6099338531494,0.0] > [TN,CAJC,115.8046594238,173.0519073486] > [TN,CBCD,18.94534057617,226.38000106811523] > [TN,CBFA,0.0,97.41000366210938] > [TN,CBIA,2.14104904175,84.66000366210938] > [TN,CBPB,95.44000244140625,26.6830517578] >
[jira] [Created] (SPARK-13943) The behavior of sum(booleantype) in Spark DataFrames is not intuitive
Wes McKinney created SPARK-13943: Summary: The behavior of sum(booleantype) in Spark DataFrames is not intuitive Key: SPARK-13943 URL: https://issues.apache.org/jira/browse/SPARK-13943 Project: Spark Issue Type: Bug Components: PySpark Reporter: Wes McKinney In NumPy and pandas, summing boolean data produces an integer indicating the number of True values: {code} In [1]: import numpy as np In [2]: arr = np.random.randn(100) In [3]: (arr > 0).sum() Out[3]: 499065 {code} In PySpark, {{sql.functions.sum(expr)}} results in an error: {code} AnalysisException: u"cannot resolve 'sum((`data0` > CAST(0 AS DOUBLE)))' due to data type mismatch: function sum requires numeric types, not BooleanType;" {code} FWIW, R is the same: {code} > sum(rnorm(100) > 0) [1] 499139 {code} Spark should consider emulating the behavior of R and Python in those environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14011) Enable `LineLength` Java checkstyle rule
[ https://issues.apache.org/jira/browse/SPARK-14011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14011: Assignee: Apache Spark > Enable `LineLength` Java checkstyle rule > > > Key: SPARK-14011 > URL: https://issues.apache.org/jira/browse/SPARK-14011 > Project: Spark > Issue Type: Task > Components: Examples >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > [Spark Coding Style > Guide|https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide] > has 100-character limit on lines, but it's disabled for Java since 11/09/15. > This issue enable *LineLength* checkstyle again. For that, this also > introduces *RedundantImport* and *RedundantModifier*, too. > {code:title=dev/checkstyle.xml|borderStyle=solid} > - > - > > > > @@ -167,5 +164,7 @@ > > > > + > + > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13963) Add binary toggle Param to ml.HashingTF
[ https://issues.apache.org/jira/browse/SPARK-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-13963: --- Comment: was deleted (was: Sure, assigned to you.) > Add binary toggle Param to ml.HashingTF > --- > > Key: SPARK-13963 > URL: https://issues.apache.org/jira/browse/SPARK-13963 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Nick Pentreath >Assignee: Bryan Cutler >Priority: Trivial > > It would be handy to add a binary toggle Param to {{HashingTF}}, as in the > scikit-learn one: > http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html > If set, then all non-zero counts will be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200821#comment-15200821 ] Xiao Li commented on SPARK-13865: - The query I posted here is downloaded from the official website. It is in the zip file: "tpc-ds-tools_v2.1.0.zip". I did not see there is any variant for the query 87 > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13938) word2phrase feature created in ML
[ https://issues.apache.org/jira/browse/SPARK-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13938: Assignee: (was: Apache Spark) > word2phrase feature created in ML > - > > Key: SPARK-13938 > URL: https://issues.apache.org/jira/browse/SPARK-13938 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Steve Weng >Priority: Critical > Original Estimate: 840h > Remaining Estimate: 840h > > I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which > transforms a sentence of words into one where certain individual consecutive > words are concatenated by using a training model/estimator (e.g. "I went to > New York" becomes "I went to new_york"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14010) ColumnPruning is conflict with PushPredicateThroughProject
[ https://issues.apache.org/jira/browse/SPARK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-14010: --- Description: ColumnPruning will insert a Project before Filter, but > ColumnPruning is conflict with PushPredicateThroughProject > -- > > Key: SPARK-14010 > URL: https://issues.apache.org/jira/browse/SPARK-14010 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Davies Liu > > ColumnPruning will insert a Project before Filter, but -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13979) Killed executor is respawned without AWS keys in standalone spark cluster
[ https://issues.apache.org/jira/browse/SPARK-13979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen George updated SPARK-13979: - Description: I'm having a problem where respawning a failed executor during a job that reads/writes parquet on S3 causes subsequent tasks to fail because of missing AWS keys. h4. Setup: I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple standalone cluster: 1 master 2 workers My application is co-located on the master machine, while the two workers are on two other machines (one worker per machine). All machines are running in EC2. I've configured my setup so that my application executes its task on two executors (one executor per worker). h4. Application: My application reads and writes parquet files on S3. I set the AWS keys on the SparkContext by doing: val sc = new SparkContext() val hadoopConf = sc.hadoopConfiguration hadoopConf.set("fs.s3n.awsAccessKeyId", "SOME_KEY") hadoopConf.set("fs.s3n.awsSecretAccessKey", "SOME_SECRET") At this point I'm done, and I go ahead and use "sc". h4. Issue: I can read and write parquet files without a problem with this setup. *BUT* if an executor dies during a job and is respawned by a worker, tasks fail with the following error: "Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the {{fs.s3n.awsAccessKeyId}} or {{fs.s3n.awsSecretAccessKey}} properties (respectively)." h4. Basic analysis I think I've traced this down to the following: SparkHadoopUtil is initialized with an empty {{SparkConf}}. Later, classes like {{DataSourceStrategy}} simply call {{SparkHadoopUtil.get.conf}} and access the (now invalid; missing various properties) {{HadoopConfiguration}} that's built from this empty {{SparkConf}} object. It's unclear to me why this is done, and it seems that the code as written would cause broken results anytime callers use {{SparkHadoopUtil.get.conf}} directly. was: I'm having a problem where respawning a failed executor during a job that reads/writes parquet on S3 causes subsequent tasks to fail because of missing AWS keys. Setup: I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple standalone cluster: 1 master 2 workers My application is co-located on the master machine, while the two workers are on two other machines (one worker per machine). All machines are running in EC2. I've configured my setup so that my application executes its task on two executors (one executor per worker). Application: My application reads and writes parquet files on S3. I set the AWS keys on the SparkContext by doing: val sc = new SparkContext() val hadoopConf = sc.hadoopConfiguration hadoopConf.set("fs.s3n.awsAccessKeyId", "SOME_KEY") hadoopConf.set("fs.s3n.awsSecretAccessKey", "SOME_SECRET") At this point I'm done, and I go ahead and use "sc". Issue: I can read and write parquet files without a problem with this setup. *BUT* if an executor dies during a job and is respawned by a worker, tasks fail with the following error: "Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the {{fs.s3n.awsAccessKeyId}} or {{fs.s3n.awsSecretAccessKey}} properties (respectively)." I think I've traced this down to the following: SparkHadoopUtil is initialized with an empty {{SparkConf}}. Later, classes like {{DataSourceStrategy}} simply call {{SparkHadoopUtil.get.conf}} and access the (now invalid; missing various properties) {{HadoopConfiguration}} that's built from this empty {{SparkConf}} object. It's unclear to me why this is done, and it seems that the code as written would cause broken results anytime callers use {{SparkHadoopUtil.get.conf}} directly. > Killed executor is respawned without AWS keys in standalone spark cluster > - > > Key: SPARK-13979 > URL: https://issues.apache.org/jira/browse/SPARK-13979 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 > Environment: I'm using Spark 1.5.2 with Hadoop 2.7 and running > experiments on a simple standalone cluster: > 1 master > 2 workers > All ubuntu 14.04 with Java 8/Scala 2.10 >Reporter: Allen George > > I'm having a problem where respawning a failed executor during a job that > reads/writes parquet on S3 causes subsequent tasks to fail because of missing > AWS keys. > h4. Setup: > I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple > standalone cluster: > 1 master > 2 workers > My application is co-located on the master machine, while the two workers are > on two other machines
[jira] [Resolved] (SPARK-13816) Add parameter checks for algorithms in Graphx
[ https://issues.apache.org/jira/browse/SPARK-13816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13816. - Resolution: Fixed Assignee: zhengruifeng Fix Version/s: 2.0.0 > Add parameter checks for algorithms in Graphx > -- > > Key: SPARK-13816 > URL: https://issues.apache.org/jira/browse/SPARK-13816 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.0.0 > > > Add parameter checks in Graphx-Algorithms: > maxIterations in Pregel > maxSteps in LabelPropagation > numIter,resetProb,tol in PageRank > maxIters,maxVal,minVal in SVDPlusPlus -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13901) We get wrong logdebug information when jump to the next locality level.
[ https://issues.apache.org/jira/browse/SPARK-13901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13901. --- Resolution: Fixed Fix Version/s: 1.6.2 2.0.0 Issue resolved by pull request 11719 [https://github.com/apache/spark/pull/11719] > We get wrong logdebug information when jump to the next locality level. > --- > > Key: SPARK-13901 > URL: https://issues.apache.org/jira/browse/SPARK-13901 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.6.1 >Reporter: yaoyin >Assignee: yaoyin >Priority: Trivial > Fix For: 2.0.0, 1.6.2 > > > In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug > information when jump to the next locality level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14014) Replace existing analysis.Catalog with SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-14014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202318#comment-15202318 ] Apache Spark commented on SPARK-14014: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11836 > Replace existing analysis.Catalog with SessionCatalog > - > > Key: SPARK-14014 > URL: https://issues.apache.org/jira/browse/SPARK-14014 > Project: Spark > Issue Type: Bug >Reporter: Andrew Or >Assignee: Andrew Or > > As of this moment, there exist many catalogs in Spark. For Spark 2.0, we will > have two high level catalogs only: SessionCatalog and ExternalCatalog. > SessionCatalog (implemented in SPARK-13923) keeps track of temporary > functions and tables and delegates other operations to ExternalCatalog. > At the same time, there's this legacy catalog called `analysis.Catalog` that > also tracks temporary functions and tables. The goal is to get rid of this > legacy catalog and replace it with SessionCatalog, which is the new thing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13905) Change signature of as.data.frame() to be consistent with the R base package
[ https://issues.apache.org/jira/browse/SPARK-13905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Rui updated SPARK-13905: Description: (was: SparkR provides a method as.data.frame() to collect a SparkR DataFrame into a local data.frame. But it conflicts the methods with the same name in the R base package. For example, {code} code as follows - countData <- matrix(1:100,ncol=4) condition <- factor(c("A","A","B","B")) dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition) Works if i dont initialize the sparkR environment. if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc) it gives following error > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~ > condition) Error in DataFrame(colData, row.names = rownames(colData)) : cannot coerce class "data.frame" to a DataFrame {code} The implementation of as.data.frame() in SparkR can be improved to avoid conflict with those in the R base package.) > Change signature of as.data.frame() to be consistent with the R base package > > > Key: SPARK-13905 > URL: https://issues.apache.org/jira/browse/SPARK-13905 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13964) Feature hashing improvements
[ https://issues.apache.org/jira/browse/SPARK-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-13964: --- Priority: Minor (was: Major) > Feature hashing improvements > > > Key: SPARK-13964 > URL: https://issues.apache.org/jira/browse/SPARK-13964 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Nick Pentreath >Priority: Minor > > Investigate improvements to Spark ML feature hashing (see e.g. > http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13963) Add binary toggle Param to ml.HashingTF
[ https://issues.apache.org/jira/browse/SPARK-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-13963: --- Assignee: Bryan Cutler > Add binary toggle Param to ml.HashingTF > --- > > Key: SPARK-13963 > URL: https://issues.apache.org/jira/browse/SPARK-13963 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Nick Pentreath >Assignee: Bryan Cutler >Priority: Trivial > > It would be handy to add a binary toggle Param to {{HashingTF}}, as in the > scikit-learn one: > http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html > If set, then all non-zero counts will be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12789) Support order by position in SQL
[ https://issues.apache.org/jira/browse/SPARK-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12789: Description: This is to support order by position in SQL, e.g. {noformat} select c1, c2, c3 from tbl order by 1, 3 {noformat} should be equivalent to {noformat} select c1, c2, c3 from tbl order by c1, c3 {noformat} was: Num in Order by is treated as constant expression at the moment. I guess it would be good to enable user to specify column by index which has been supported in Hive 0.11.0 and later. > Support order by position in SQL > > > Key: SPARK-12789 > URL: https://issues.apache.org/jira/browse/SPARK-12789 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: zhichao-li >Priority: Minor > > This is to support order by position in SQL, e.g. > {noformat} > select c1, c2, c3 from tbl order by 1, 3 > {noformat} > should be equivalent to > {noformat} > select c1, c2, c3 from tbl order by c1, c3 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14010) ColumnPruning is conflict with PushPredicateThroughProject
[ https://issues.apache.org/jira/browse/SPARK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-14010: --- Description: ColumnPruning will insert a Project before Filter, but PushPredicateThroughProject will move the Filter before Project, they make the optimizer not stable. (was: ColumnPruning will insert a Project before Filter, but ) > ColumnPruning is conflict with PushPredicateThroughProject > -- > > Key: SPARK-14010 > URL: https://issues.apache.org/jira/browse/SPARK-14010 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Davies Liu > > ColumnPruning will insert a Project before Filter, but > PushPredicateThroughProject will move the Filter before Project, they make > the optimizer not stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13976) do not remove sub-queries added by user when generate SQL
Wenchen Fan created SPARK-13976: --- Summary: do not remove sub-queries added by user when generate SQL Key: SPARK-13976 URL: https://issues.apache.org/jira/browse/SPARK-13976 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13951) PySpark ml.pipeline support export/import - nested Piplines
[ https://issues.apache.org/jira/browse/SPARK-13951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13951: Assignee: Apache Spark > PySpark ml.pipeline support export/import - nested Piplines > --- > > Key: SPARK-13951 > URL: https://issues.apache.org/jira/browse/SPARK-13951 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13957) Support group by ordinal in SQL
[ https://issues.apache.org/jira/browse/SPARK-13957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203089#comment-15203089 ] Apache Spark commented on SPARK-13957: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/11846 > Support group by ordinal in SQL > --- > > Key: SPARK-13957 > URL: https://issues.apache.org/jira/browse/SPARK-13957 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > This is to support order by position in SQL, e.g. > {noformat} > select c1, c2, c3, sum(*) from tbl group by by 1, 3, c4 > {noformat} > should be equivalent to > {noformat} > select c1, c2, c3, sum(*) from tbl order by c1, c3, c4 > {noformat} > We only convert integer literals (not foldable expressions). > For positions that are aggregate functions, an analysis exception should be > thrown, e.g. in postgres; > {noformat} > rxin=# select 'one', 'two', count(*) from r1 group by 1, 3; > ERROR: aggregate functions are not allowed in GROUP BY > LINE 1: select 'one', 'two', count(*) from r1 group by 1, 3; > ^ > {noformat} > This should be controlled by config option spark.sql.groupByOrdinal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13957) Support group by ordinal in SQL
[ https://issues.apache.org/jira/browse/SPARK-13957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13957: Assignee: (was: Apache Spark) > Support group by ordinal in SQL > --- > > Key: SPARK-13957 > URL: https://issues.apache.org/jira/browse/SPARK-13957 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > This is to support order by position in SQL, e.g. > {noformat} > select c1, c2, c3, sum(*) from tbl group by by 1, 3, c4 > {noformat} > should be equivalent to > {noformat} > select c1, c2, c3, sum(*) from tbl order by c1, c3, c4 > {noformat} > We only convert integer literals (not foldable expressions). > For positions that are aggregate functions, an analysis exception should be > thrown, e.g. in postgres; > {noformat} > rxin=# select 'one', 'two', count(*) from r1 group by 1, 3; > ERROR: aggregate functions are not allowed in GROUP BY > LINE 1: select 'one', 'two', count(*) from r1 group by 1, 3; > ^ > {noformat} > This should be controlled by config option spark.sql.groupByOrdinal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13957) Support group by ordinal in SQL
[ https://issues.apache.org/jira/browse/SPARK-13957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13957: Assignee: Apache Spark > Support group by ordinal in SQL > --- > > Key: SPARK-13957 > URL: https://issues.apache.org/jira/browse/SPARK-13957 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > This is to support order by position in SQL, e.g. > {noformat} > select c1, c2, c3, sum(*) from tbl group by by 1, 3, c4 > {noformat} > should be equivalent to > {noformat} > select c1, c2, c3, sum(*) from tbl order by c1, c3, c4 > {noformat} > We only convert integer literals (not foldable expressions). > For positions that are aggregate functions, an analysis exception should be > thrown, e.g. in postgres; > {noformat} > rxin=# select 'one', 'two', count(*) from r1 group by 1, 3; > ERROR: aggregate functions are not allowed in GROUP BY > LINE 1: select 'one', 'two', count(*) from r1 group by 1, 3; > ^ > {noformat} > This should be controlled by config option spark.sql.groupByOrdinal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13946) PySpark DataFrames allows you to silently use aggregate expressions derived from different table expressions
Wes McKinney created SPARK-13946: Summary: PySpark DataFrames allows you to silently use aggregate expressions derived from different table expressions Key: SPARK-13946 URL: https://issues.apache.org/jira/browse/SPARK-13946 Project: Spark Issue Type: Bug Components: PySpark Reporter: Wes McKinney In my opinion, this code should raise an exception rather than silently discarding the predicate: {code} import numpy as np import pandas as pd df = pd.DataFrame({'foo': np.random.randn(100), 'bar': np.random.randn(100)}) sdf = sqlContext.createDataFrame(df) sdf2 = sdf[sdf.bar > 0] sdf.agg(F.count(sdf2.foo)).show() +--+ |count(foo)| +--+ | 100| +--+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-13932: - Affects Version/s: 2.0.0 > CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException > -- > > Key: SPARK-13932 > URL: https://issues.apache.org/jira/browse/SPARK-13932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Tien-Dung LE > > A complex aggregate query using condition in the aggregate function and GROUP > BY HAVING clause raises an exception. This issue only happens in Spark > version 1.6.+ but not in Spark 1.5.+. > Here is a typical error message {code} > org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: > b#55, b#124.; line 1 pos 178 > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > {code} > Here is a code snippet to re-produce the error in a spark-shell session: > {code} > import sqlContext.implicits._ > case class Toto( a: String = f"${(math.random*1e6).toLong}%06.0f", > b: Int = (math.random*1e3).toInt, > n: Int = (math.random*1e3).toInt, > m: Double = (math.random*1e3)) > val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto()) > val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data ) > df.registerTempTable( "toto" ) > val sqlSelect1 = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS > k3, GROUPING__ID" > val sqlSelect2 = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS > k2, SUM(m) AS k3, GROUPING__ID" > val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))" > val sqlHaving = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)" > sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK > sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK > sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR > {code} > And here is the full log > {code} > scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) > res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: > bigint, k3: double, GROUPING__ID: int] > scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) > res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: > bigint, k3: double, GROUPING__ID: int] > scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR > org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: > b#55, b#124.; line 1 pos 178 > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at
[jira] [Commented] (SPARK-13950) Generate code for sort merge left/right outer join
[ https://issues.apache.org/jira/browse/SPARK-13950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198162#comment-15198162 ] Apache Spark commented on SPARK-13950: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11771 > Generate code for sort merge left/right outer join > -- > > Key: SPARK-13950 > URL: https://issues.apache.org/jira/browse/SPARK-13950 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text
[ https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-13982: -- Summary: SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text (was: SparkR - KMeans predict: Output column name of features is an unclear, automatically genetared text) > SparkR - KMeans predict: Output column name of features is an unclear, > automatic genetared text > --- > > Key: SPARK-13982 > URL: https://issues.apache.org/jira/browse/SPARK-13982 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Currently KMean-predict's features' output column name is set to something > like this: "vecAssembler_522ba59ea239__output", which is the default output > column name of the "VectorAssembler". > Example: > showDF(predict(model, training)) shows something like this: > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, > prediction:int] > This name is automatically generated and very unclear from user perspective. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13858: Assignee: Apache Spark > TPCDS query 21 returns wrong results compared to TPC official result set > - > > Key: SPARK-13858 > URL: https://issues.apache.org/jira/browse/SPARK-13858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN >Assignee: Apache Spark > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 21 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABDA) ; I believe 2 > other rows are missing as well. > Actual results: > {noformat} > [null,AABD,2565,1922] > [null,AAHD,2956,2052] > [null,AALA,2042,1793] > [null,ACGC,2373,1771] > [null,ACKC,2321,1856] > [null,ACOB,1504,1397] > [null,ADKB,1820,2163] > [null,AEAD,2631,1965] > [null,AEOC,1659,1798] > [null,AFAC,1965,1705] > [null,AFAD,1769,1313] > [null,AHDE,2700,1985] > [null,AHHA,1578,1082] > [null,AIEC,1756,1804] > [null,AIMC,3603,2951] > [null,AJAC,2109,1989] > [null,AJKB,2573,3540] > [null,ALBE,3458,2992] > [null,ALCE,1720,1810] > [null,ALEC,2569,1946] > [null,ALNB,2552,1750] > [null,ANFE,2022,2269] > [null,AOIB,2982,2540] > [null,APJB,2344,2593] > [null,BAPD,2182,2787] > [null,BDCE,2844,2069] > [null,BDDD,2417,2537] > [null,BDJA,1584,1666] > [null,BEOD,2141,2649] > [null,BFCC,2745,2020] > [null,BFMB,1642,1364] > [null,BHPC,1923,1780] > [null,BIDB,1956,2836] > [null,BIGB,2023,2344] > [null,BIJB,1977,2728] > [null,BJFE,1891,2390] > [null,BLDE,1983,1797] > [null,BNID,2485,2324] > [null,BNLD,2385,2786] > [null,BOMB,2291,2092] > [null,CAAA,2233,2560] > [null,CBCD,1540,2012] > [null,CBIA,2394,2122] > [null,CBPB,1790,1661] > [null,CCMD,2654,2691] > [null,CDBC,1804,2072] > [null,CFEA,1941,1567] > [null,CGFD,2123,2265] > [null,CHPC,2933,2174] > [null,CIGD,2618,2399] > [null,CJCB,2728,2367] > [null,CJLA,1350,1732] > [null,CLAE,2578,2329] > [null,CLGA,1842,1588] > [null,CLLB,3418,2657] > [null,CLOB,3115,2560] > [null,CMAD,1991,2243] > [null,CMJA,1261,1855] > [null,CMLA,3288,2753] > [null,CMPD,1320,1676] > [null,CNGB,2340,2118] > [null,CNHD,3519,3348] > [null,CNPC,2561,1948] > [null,DCPC,2664,2627] > [null,DDHA,1313,1926] > [null,DDND,1109,835] > [null,DEAA,2141,1847] > [null,DEJA,3142,2723] > [null,DFKB,1470,1650] > [null,DGCC,2113,2331] > [null,DGFC,2201,2928] > [null,DHPA,2467,2133] > [null,DMBA,3085,2087] > [null,DPAB,3494,3081] > [null,EAEC,2133,2148] > [null,EAPA,1560,1275] > [null,ECGC,2815,3307] > [null,EDPD,2731,1883] > [null,EEEC,2024,1902] > [null,EEMC,2624,2387] > [null,EFFA,2047,1878] > [null,EGJA,2403,2633] > [null,EGMA,2784,2772] > [null,EGOC,2389,1753] > [null,EHFD,1940,1420] > [null,EHLB,2320,2057] > [null,EHPA,1898,1853] > [null,EIPB,2930,2326] > [null,EJAE,2582,1836] > [null,EJIB,2257,1681] > [null,EJJA,2791,1941] > [null,EJJD,3410,2405] > [null,EJNC,2472,2067] > [null,EJPD,1219,1229] > [null,EKEB,2047,1713] > [null,EMEA,2502,1897] > [null,EMKC,2362,2042] > [null,ENAC,2011,1909] > [null,ENFB,2507,2162] > [null,ENOD,3371,2709] > {noformat} > Expected results: > {noformat} > +--+--++---+ > | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | > +--+--++---+ > | Bad cards must make. | AACD | 1889 | 2168 | > | Bad cards must make. | AAHD | 2739 |
[jira] [Updated] (SPARK-7992) Hide private classes/objects in in generated Java API doc
[ https://issues.apache.org/jira/browse/SPARK-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7992: - Assignee: (was: Xiangrui Meng) > Hide private classes/objects in in generated Java API doc > - > > Key: SPARK-7992 > URL: https://issues.apache.org/jira/browse/SPARK-7992 > Project: Spark > Issue Type: Improvement > Components: Build, Documentation >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > > After SPARK-5610, we found that private classes/objects still show up in the > generated Java API doc, e.g., under `org.apache.spark.api.r` we can see > {code} > BaseRRDD > PairwiseRRDD > RRDD > SpecialLengths > StringRRDD > {code} > We should update genjavadoc to hide those private classes/methods. The best > approach is to find a good mapping from Scala private to Java, and merge it > into the main genjavadoc repo. A WIP PR is at > https://github.com/typesafehub/genjavadoc/pull/47. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13038) PySpark ml.pipeline support export/import - non-nested Pipelines
[ https://issues.apache.org/jira/browse/SPARK-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-13038: -- Summary: PySpark ml.pipeline support export/import - non-nested Pipelines (was: PySpark ml.pipeline support export/import) > PySpark ml.pipeline support export/import - non-nested Pipelines > > > Key: SPARK-13038 > URL: https://issues.apache.org/jira/browse/SPARK-13038 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Xusen Yin >Priority: Minor > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/pipeline.py. Please refer the implementation > at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14009) Fail the tests if the any catalyst rule reach max number of iteration.
Davies Liu created SPARK-14009: -- Summary: Fail the tests if the any catalyst rule reach max number of iteration. Key: SPARK-14009 URL: https://issues.apache.org/jira/browse/SPARK-14009 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Recently some catalyst rule becoming not stable (conflict with each other), or continue adding stuff into query plan, we should detect this early by fail the tests if any rule reach max number of iterations (200). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13776) Web UI is not available after ./sbin/start-master.sh
[ https://issues.apache.org/jira/browse/SPARK-13776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13776. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11615 [https://github.com/apache/spark/pull/11615] > Web UI is not available after ./sbin/start-master.sh > > > Key: SPARK-13776 > URL: https://issues.apache.org/jira/browse/SPARK-13776 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 > Environment: Solaris 11.3, Oracle SPARC T-5 8 with 1024 hardware > threads >Reporter: Erik O'Shaughnessy >Priority: Minor > Fix For: 2.0.0 > > > The Apache Spark Web UI fails to become available after starting a Spark > master in stand-alone mode: > $ ./sbin/start-master.sh > The log file contains the following: > {quote} > cat spark-hadoop-org.apache.spark.deploy.master.Master-1-t5-8-002.out > Spark Command: /usr/java/bin/java -cp > /usr/local/spark-1.6.0_nohadoop/conf/:/usr/local/spark-1.6.0_nohadoop/assembly/target/scala-2.10/spark-assembly-1.6.0-hadoop2.2.0.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-core-3.2.10.jar > -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip t5-8-002 --port > 7077 --webui-port 8080 > > 16/01/27 12:00:42 WARN AbstractConnector: insufficient threads configured for > SelectChannelConnector@0.0.0.0:8080 > 16/01/27 12:00:42 WARN AbstractConnector: insufficient threads configured for > SelectChannelConnector@t5-8-002:6066 > {quote} > I did some poking around and it seems that message is coming from Jetty and > indicates a mismatch between Jetty's default maxThreads configuration and the > actual number of CPUs available on the hardware (1024). I was not able to > find a way to successfully change Jetty's configuration at run-time. > Our work around was to disable CPUs until the WARN messages did not occur in > the log file, which was when NCPUs = 504. > I don't know for certain that this is isn't a known problem in Jetty from > looking at their bug reports, but I wasn't able to locate a Jetty issue that > described this problem. > While not specifically an Apache Spark problem, I thought documenting it > would at least be helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13461) Duplicated example code merge and cleanup
[ https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203076#comment-15203076 ] Xusen Yin commented on SPARK-13461: --- Yes we'll delete it. > Duplicated example code merge and cleanup > - > > Key: SPARK-13461 > URL: https://issues.apache.org/jira/browse/SPARK-13461 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Merge duplicated code after we finishing the example code substitution. > Duplications include: > * JavaTrainValidationSplitExample > * TrainValidationSplitExample > * Random data generation in mllib-statistics.md need to remove "-" in each > line. > * Others can be added here ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13937) PySpark ML JavaWrapper, variable _java_obj should not be static
[ https://issues.apache.org/jira/browse/SPARK-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197902#comment-15197902 ] Apache Spark commented on SPARK-13937: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/11767 > PySpark ML JavaWrapper, variable _java_obj should not be static > --- > > Key: SPARK-13937 > URL: https://issues.apache.org/jira/browse/SPARK-13937 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > > In PySpark ML wrapper.py, the abstract class {{JavaWrapper}} has a static > variable {{_java_obj}}. This is meant to hold an instance of a companion > Java object. It seems as though it was made static accidentally because it > is never used, and all assignments done in derived classes are done to a > member variable with {{self._java_obj}}. This does not cause any problems > with the current functionality, but it should be changed so as not to cause > any confusion and misuse in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13821) TPC-DS Query 20 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roy Cecil closed SPARK-13821. - Resolution: Not A Problem > TPC-DS Query 20 fails to compile > > > Key: SPARK-13821 > URL: https://issues.apache.org/jira/browse/SPARK-13821 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 20 Fails to compile with the follwing Error Message > {noformat} > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13761) Deprecate validateParams
[ https://issues.apache.org/jira/browse/SPARK-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200097#comment-15200097 ] Apache Spark commented on SPARK-13761: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/11790 > Deprecate validateParams > > > Key: SPARK-13761 > URL: https://issues.apache.org/jira/browse/SPARK-13761 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Minor > Fix For: 2.0.0 > > > Deprecate validateParams() method here: > [https://github.com/apache/spark/blob/035d3acdf3c1be5b309a861d5c5beb803b946b5e/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L553] > Move all functionality in overridden methods to transformSchema(). > Check docs to make sure they indicate complex Param interaction checks should > be done in transformSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13935) Other clients' connection hang up when someone do huge load
[ https://issues.apache.org/jira/browse/SPARK-13935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197554#comment-15197554 ] Tao Wang edited comment on SPARK-13935 at 3/16/16 3:51 PM: --- [~marmbrus] [~liancheng] [~chenghao] was (Author: wangtao): [~marmbrus][~liancheng][~chenghao] > Other clients' connection hang up when someone do huge load > --- > > Key: SPARK-13935 > URL: https://issues.apache.org/jira/browse/SPARK-13935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.2, 1.6.0, 1.6.1 >Reporter: Tao Wang >Priority: Critical > > We run a sql like "insert overwrite table store_returns partition > (sr_returned_date) select xx" using beeline then it will block other > beeline connection while invoke the Hive method via > "ClientWrapper.loadDynamicPartitions". > The reason is that "withHiveState" will lock "clientLoader". Sadly when a new > client comes, it will invoke "setConf" methods which is also sychronized with > "clientLoader". > So the problem is that if the first sql took very long time to run, then all > other client could not connect to thrift server successfully. > We tested on release 1.5.1. not sure if latest release has same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13983) HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since 1.6 version (both multi-session and single session)
[ https://issues.apache.org/jira/browse/SPARK-13983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13983: --- Assignee: Cheng Lian > HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since > 1.6 version (both multi-session and single session) > -- > > Key: SPARK-13983 > URL: https://issues.apache.org/jira/browse/SPARK-13983 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1 > Environment: ubuntu, spark 1.6.0 standalone, spark 1.6.1 standalone > (tried spark branch-1.6 snapshot as well) > compiled with scala 2.10.5 and hadoop 2.6 > (-Phadoop-2.6 -Psparkr -Phive -Phive-thriftserver) >Reporter: Teng Qiu >Assignee: Cheng Lian > > HiveThriftServer2 should be able to get "\--hiveconf" or ''\-\-hivevar" > variables from JDBC client, either from command line parameter of beeline, > such as > {{beeline --hiveconf spark.sql.shuffle.partitions=3 --hivevar > db_name=default}} > or from JDBC connection string, like > {{jdbc:hive2://localhost:1?spark.sql.shuffle.partitions=3#db_name=default}} > this worked in spark version 1.5.x, but after upgraded to 1.6, it doesn't > work. > to reproduce this issue, try to connect to HiveThriftServer2 with beeline: > {code} > bin/beeline -u jdbc:hive2://localhost:1 \ > --hiveconf spark.sql.shuffle.partitions=3 \ > --hivevar db_name=default > {code} > or > {code} > bin/beeline -u > jdbc:hive2://localhost:1?spark.sql.shuffle.partitions=3#db_name=default > {code} > will get following results: > {code} > 0: jdbc:hive2://localhost:1> set spark.sql.shuffle.partitions; > +---++--+ > | key | value | > +---++--+ > | spark.sql.shuffle.partitions | 200| > +---++--+ > 1 row selected (0.192 seconds) > 0: jdbc:hive2://localhost:1> use ${db_name}; > Error: org.apache.spark.sql.AnalysisException: cannot recognize input near > '$' '{' 'db_name' in switch database statement; line 1 pos 4 (state=,code=0) > {code} > - > but this bug does not affect current versions of spark-sql CLI, following > commands works: > {code} > bin/spark-sql --master local[2] \ > --hiveconf spark.sql.shuffle.partitions=3 \ > --hivevar db_name=default > spark-sql> set spark.sql.shuffle.partitions > spark.sql.shuffle.partitions 3 > Time taken: 1.037 seconds, Fetched 1 row(s) > spark-sql> use ${db_name}; > OK > Time taken: 1.697 seconds > {code} > so I think it may caused by this change: > https://github.com/apache/spark/pull/8909 ( [SPARK-10810] [SPARK-10902] [SQL] > Improve session management in SQL ) > perhaps by calling {{hiveContext.newSession}}, the variables from > {{sessionConf}} were not loaded into the new session? > (https://github.com/apache/spark/pull/8909/files#diff-8f8b7f4172e8a07ff20a4dbbbcc57b1dR69) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12789) Support order by position in SQL
[ https://issues.apache.org/jira/browse/SPARK-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12789: Summary: Support order by position in SQL (was: Support order by position) > Support order by position in SQL > > > Key: SPARK-12789 > URL: https://issues.apache.org/jira/browse/SPARK-12789 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: zhichao-li >Priority: Minor > > Num in Order by is treated as constant expression at the moment. I guess it > would be good to enable user to specify column by index which has been > supported in Hive 0.11.0 and later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13960) JAR/File HTTP Server doesn't respect "spark.driver.host" and there is no "spark.fileserver.host" option
[ https://issues.apache.org/jira/browse/SPARK-13960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200690#comment-15200690 ] Ilya Ostrovskiy commented on SPARK-13960: - exporting the SPARK_LOCAL_IP environment variable appears to workaround this issue, however, setting SPARK_PUBLIC_DNS does not work, despite the documentation stating that the latter is "[the h]ostname your Spark program will advertise to other machines." > JAR/File HTTP Server doesn't respect "spark.driver.host" and there is no > "spark.fileserver.host" option > --- > > Key: SPARK-13960 > URL: https://issues.apache.org/jira/browse/SPARK-13960 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 1.6.1 > Environment: Any system with more than one IP address >Reporter: Ilya Ostrovskiy > > There is no option to specify which hostname/IP address the jar/file server > listens on, and rather than using "spark.driver.host" if specified, the > jar/file server will listen on the system's primary IP address. This is an > issue when submitting an application in client mode on a machine with two > NICs connected to two different networks. > Steps to reproduce: > 1) Have a cluster in a remote network, whose master is on 192.168.255.10 > 2) Have a machine at another location, with a "primary" IP address of > 192.168.1.2, connected to the "remote network" as well, with the IP address > 192.168.255.250. Let's call this the "client machine". > 3) Ensure every machine in the spark cluster at the remote location can ping > 192.168.255.250 and reach the client machine via that address. > 4) On the client: > {noformat} > spark-submit --deploy-mode client --conf "spark.driver.host=192.168.255.250" > --master spark://192.168.255.10:7077 --class > > {noformat} > 5) Navigate to http://192.168.255.250:4040/ and ensure that executors from > the remote cluster have found the driver on the client machine > 6) Navigate to http://192.168.255.250:4040/environment/, and scroll to the > bottom > 7) Observe that the JAR you specified in Step 4 will be listed under > http://192.168.1.2:/jars/.jar > 8) Enjoy this stack trace periodically appearing on the client machine when > the nodes in the remote cluster cant connect to 192.168.1.2 to get your JAR > {noformat} > 16/03/17 03:25:55 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 5, > 192.168.255.11): java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:588) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) > at >
[jira] [Updated] (SPARK-12789) Support order by position in SQL
[ https://issues.apache.org/jira/browse/SPARK-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12789: Description: This is to support order by position in SQL, e.g. {noformat} select c1, c2, c3 from tbl order by 1 desc, 3 {noformat} should be equivalent to {noformat} select c1, c2, c3 from tbl order by c1 desc, c3 {noformat} We only convert integer literals (not foldable expressions). We should make sure this also works with select *. was: This is to support order by position in SQL, e.g. {noformat} select c1, c2, c3 from tbl order by 1 desc, 3 {noformat} should be equivalent to {noformat} select c1, c2, c3 from tbl order by c1 desc, c3 {noformat} We should make sure this also works with select *. > Support order by position in SQL > > > Key: SPARK-12789 > URL: https://issues.apache.org/jira/browse/SPARK-12789 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: zhichao-li >Priority: Minor > > This is to support order by position in SQL, e.g. > {noformat} > select c1, c2, c3 from tbl order by 1 desc, 3 > {noformat} > should be equivalent to > {noformat} > select c1, c2, c3 from tbl order by c1 desc, c3 > {noformat} > We only convert integer literals (not foldable expressions). > We should make sure this also works with select *. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13961) spark.ml ChiSqSelector should support other numeric types for label
Nick Pentreath created SPARK-13961: -- Summary: spark.ml ChiSqSelector should support other numeric types for label Key: SPARK-13961 URL: https://issues.apache.org/jira/browse/SPARK-13961 Project: Spark Issue Type: Sub-task Components: ML Reporter: Nick Pentreath Assignee: Benjamin Fradet Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging
[ https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197584#comment-15197584 ] Apache Spark commented on SPARK-13928: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/11764 > Move org.apache.spark.Logging into org.apache.spark.internal.Logging > > > Key: SPARK-13928 > URL: https://issues.apache.org/jira/browse/SPARK-13928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin > > Logging was made private in Spark 2.0. If we move it, then users would be > able to create a Logging trait themselves to avoid changing their own code. > Alternatively, we can also provide in a compatibility package that adds > logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13821) TPC-DS Query 20 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201506#comment-15201506 ] Roy Cecil edited comment on SPARK-13821 at 3/18/16 2:09 PM: Dilip, Removed the extra comma from the query and it compiles. Since I am really comparing standard SQL , I just want to ensure that this is not a violation of ANSI standard. Let me explore a little bit more. was (Author: roycecil): Dilip, Removed the query and it compiles. Since I am really comparing standard SQL , I just want to ensure that this is not a violation of ANSI standard. Let me explore a little bit more. > TPC-DS Query 20 fails to compile > > > Key: SPARK-13821 > URL: https://issues.apache.org/jira/browse/SPARK-13821 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 20 Fails to compile with the follwing Error Message > {noformat} > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13993) PySpark ml.feature.RFormula/RFormulaModel support export/import
[ https://issues.apache.org/jira/browse/SPARK-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13993: Assignee: Apache Spark > PySpark ml.feature.RFormula/RFormulaModel support export/import > --- > > Key: SPARK-13993 > URL: https://issues.apache.org/jira/browse/SPARK-13993 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Xusen Yin >Assignee: Apache Spark >Priority: Minor > > Add save/load for RFormula and its model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13733) Support initial weight distribution in personalized PageRank
[ https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198327#comment-15198327 ] Gayathri Murali commented on SPARK-13733: - [~mengxr] Should the rest of the vertices also be set to resetProb(which is 0.25 initial weight) ? > Support initial weight distribution in personalized PageRank > > > Key: SPARK-13733 > URL: https://issues.apache.org/jira/browse/SPARK-13733 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Xiangrui Meng > > It would be nice to support personalized PageRank with an initial weight > distribution besides a single vertex. It should be easy to modify the current > implementation to add this support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13034) PySpark ml.classification support export/import
[ https://issues.apache.org/jira/browse/SPARK-13034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-13034. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11707 [https://github.com/apache/spark/pull/11707] > PySpark ml.classification support export/import > --- > > Key: SPARK-13034 > URL: https://issues.apache.org/jira/browse/SPARK-13034 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/classification.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14005) Make RDD more compatible with Scala's collection
[ https://issues.apache.org/jira/browse/SPARK-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203056#comment-15203056 ] zhengruifeng commented on SPARK-14005: -- ok, plz close this jira. > Make RDD more compatible with Scala's collection > - > > Key: SPARK-14005 > URL: https://issues.apache.org/jira/browse/SPARK-14005 > Project: Spark > Issue Type: Question > Components: Spark Core >Reporter: zhengruifeng >Priority: Trivial > > How about implementing some more methods for RDD to make it more compatible > with Scala's collection? > Such as: > nonEmpty, slice, takeRight, contains, last, reverse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13968) Use MurmurHash3 for hashing String features
[ https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202003#comment-15202003 ] Joseph K. Bradley commented on SPARK-13968: --- I'm going to close this in favor of the older ticket. I'll make the old ticket a subtask. But I agree it'd be good to switch. > Use MurmurHash3 for hashing String features > --- > > Key: SPARK-13968 > URL: https://issues.apache.org/jira/browse/SPARK-13968 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Nick Pentreath >Assignee: Yanbo Liang >Priority: Minor > > Typically feature hashing is done on strings, i.e. feature names (or in the > case of raw feature indexes, either the string representation of the > numerical index can be used, or the index used "as-is" and not hashed). > It is common to use a well-distributed hash function such as MurmurHash3. > This is the case in e.g. > [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher]. > Currently Spark's {{HashingTF}} uses the object's hash code. Look at using > MurmurHash3 (at least for {{String}} which is the common case). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13629) Add binary toggle Param to CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201993#comment-15201993 ] Joseph K. Bradley commented on SPARK-13629: --- [~mlnick] Thanks for handling these count/hashing improvements! This PR + the other JIRAs sound good to me. > Add binary toggle Param to CountVectorizer > -- > > Key: SPARK-13629 > URL: https://issues.apache.org/jira/browse/SPARK-13629 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Minor > Fix For: 2.0.0 > > > It would be handy to add a binary toggle Param to CountVectorizer, as in the > scikit-learn one: > [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html] > If set, then all non-zero counts will be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11319) PySpark silently accepts null values in non-nullable DataFrame fields.
[ https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11319: Assignee: (was: Apache Spark) > PySpark silently accepts null values in non-nullable DataFrame fields. > -- > > Key: SPARK-11319 > URL: https://issues.apache.org/jira/browse/SPARK-11319 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Kevin Cox > > Running the following code with a null value in a non-nullable column > silently works. This makes the code incredibly hard to trust. > {code} > In [2]: from pyspark.sql.types import * > In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", > TimestampType(), False)])).collect() > Out[3]: [Row(a=None)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13948) MiMa Check should catch if the visibility change to `private`
[ https://issues.apache.org/jira/browse/SPARK-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-13948: --- Component/s: Project Infra > MiMa Check should catch if the visibility change to `private` > -- > > Key: SPARK-13948 > URL: https://issues.apache.org/jira/browse/SPARK-13948 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Reporter: Dongjoon Hyun >Assignee: Josh Rosen >Priority: Critical > Fix For: 2.0.0 > > > `GenerateMIMAIgnore.scala` makes `.generated-mima-class-excludes` from the > current code having `private` class. As a result, it ignores the case : > visibility goes from public into private. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200516#comment-15200516 ] Xiao Li commented on SPARK-13865: - This is the same as the https://issues.apache.org/jira/browse/SPARK-13859. The query used in this JIRA is different from the original one in TPCDS, as posted below. Therefore, the results are different from the official result. [~jfc...@us.ibm.com] You need to rerun it by using the standard query. Thanks! Spark does support Except. You do not need to change it. {code} select count(*) from ((select distinct c_last_name, c_first_name, d_date from store_sales, date_dim, customer where store_sales.ss_sold_date_sk = date_dim.d_date_sk and store_sales.ss_customer_sk = customer.c_customer_sk and d_month_seq between [DMS] and [DMS]+11) except (select distinct c_last_name, c_first_name, d_date from catalog_sales, date_dim, customer where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk and d_month_seq between [DMS] and [DMS]+11) except (select distinct c_last_name, c_first_name, d_date from web_sales, date_dim, customer where web_sales.ws_sold_date_sk = date_dim.d_date_sk and web_sales.ws_bill_customer_sk = customer.c_customer_sk and d_month_seq between [DMS] and [DMS]+11) ) cool_cust ; {code} > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13972) hive tests should fail if SQL generation failed
[ https://issues.apache.org/jira/browse/SPARK-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13972: --- Assignee: Wenchen Fan > hive tests should fail if SQL generation failed > --- > > Key: SPARK-13972 > URL: https://issues.apache.org/jira/browse/SPARK-13972 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13776) Web UI is not available after ./sbin/start-master.sh
[ https://issues.apache.org/jira/browse/SPARK-13776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13776: -- Assignee: Shixiong Zhu > Web UI is not available after ./sbin/start-master.sh > > > Key: SPARK-13776 > URL: https://issues.apache.org/jira/browse/SPARK-13776 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 > Environment: Solaris 11.3, Oracle SPARC T-5 8 with 1024 hardware > threads >Reporter: Erik O'Shaughnessy >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.0.0 > > > The Apache Spark Web UI fails to become available after starting a Spark > master in stand-alone mode: > $ ./sbin/start-master.sh > The log file contains the following: > {quote} > cat spark-hadoop-org.apache.spark.deploy.master.Master-1-t5-8-002.out > Spark Command: /usr/java/bin/java -cp > /usr/local/spark-1.6.0_nohadoop/conf/:/usr/local/spark-1.6.0_nohadoop/assembly/target/scala-2.10/spark-assembly-1.6.0-hadoop2.2.0.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-core-3.2.10.jar > -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip t5-8-002 --port > 7077 --webui-port 8080 > > 16/01/27 12:00:42 WARN AbstractConnector: insufficient threads configured for > SelectChannelConnector@0.0.0.0:8080 > 16/01/27 12:00:42 WARN AbstractConnector: insufficient threads configured for > SelectChannelConnector@t5-8-002:6066 > {quote} > I did some poking around and it seems that message is coming from Jetty and > indicates a mismatch between Jetty's default maxThreads configuration and the > actual number of CPUs available on the hardware (1024). I was not able to > find a way to successfully change Jetty's configuration at run-time. > Our work around was to disable CPUs until the WARN messages did not occur in > the log file, which was when NCPUs = 504. > I don't know for certain that this is isn't a known problem in Jetty from > looking at their bug reports, but I wasn't able to locate a Jetty issue that > described this problem. > While not specifically an Apache Spark problem, I thought documenting it > would at least be helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-10788. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 9474 [https://github.com/apache/spark/pull/9474] > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.0.0 > > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12719) SQL generation support for generators (including UDTF)
[ https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200419#comment-15200419 ] Apache Spark commented on SPARK-12719: -- User 'yy2016' has created a pull request for this issue: https://github.com/apache/spark/pull/11795 > SQL generation support for generators (including UDTF) > -- > > Key: SPARK-12719 > URL: https://issues.apache.org/jira/browse/SPARK-12719 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13461) Duplicated example code merge and cleanup
[ https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203041#comment-15203041 ] Gabor Liptak commented on SPARK-13461: -- [~yinxusen] {{examples/src/main/scala/org/apache/spark/examples/ml/TrainValidationSplitExample.scala}} doesn't seem to be referenced. Do you see it simply deleted? Thanks > Duplicated example code merge and cleanup > - > > Key: SPARK-13461 > URL: https://issues.apache.org/jira/browse/SPARK-13461 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Merge duplicated code after we finishing the example code substitution. > Duplications include: > * JavaTrainValidationSplitExample > * TrainValidationSplitExample > * Random data generation in mllib-statistics.md need to remove "-" in each > line. > * Others can be added here ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13969) Extend input format that feature hashing can handle
[ https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202007#comment-15202007 ] Joseph K. Bradley commented on SPARK-13969: --- I think HashingTF could be extended to handle this in two steps: * Handle more input types [SPARK-11107] * Accept multiple input columns [SPARK-8418] > Extend input format that feature hashing can handle > --- > > Key: SPARK-13969 > URL: https://issues.apache.org/jira/browse/SPARK-13969 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Nick Pentreath >Priority: Minor > > Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in > scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of > strings and computes term frequencies. > The use cases for feature hashing extend to arbitrary feature values (binary, > count or real-valued). For example, scikit-learn's {{FeatureHasher}} can > accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this > way, feature hashing can operate as both "one-hot encoder" and "vector > assembler" at the same time. > Investigate adding a more generic feature hasher (that in turn can be used by > {{HashingTF}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13960) HTTP-based JAR Server doesn't respect spark.driver.host and there is no "spark.fileserver.host" option
[ https://issues.apache.org/jira/browse/SPARK-13960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ostrovskiy updated SPARK-13960: Description: There is no option to specify which hostname/IP address the jar/file server listens on, and rather than using "spark.driver.host" if specified, the jar/file server will listen on the system's primary IP address. This is an issue when submitting an application in client mode on a machine with two NICs connected to two different networks. Steps to reproduce: 1) Have a cluster in a remote network, whose master is on 192.168.255.10 2) Have a machine at another location, with a "primary" IP address of 192.168.1.2, connected to the "remote network" as well, with the IP address 192.168.255.250. Let's call this the "client machine". 3) Ensure every machine in the spark cluster at the remote location can ping 192.168.255.250 and reach the client machine via that address. 4) On the client: {noformat} spark-submit --deploy-mode client --conf "spark.driver.host=192.168.255.250" --master spark://192.168.255.10:7077 --class {noformat} 5) Navigate to http://192.168.255.250:4040/ and ensure that executors from the remote cluster have found the driver on the client machine 6) Navigate to http://192.168.255.250:4040/environment/, and scroll to the bottom 7) Observe that the JAR you specified in Step 4 will be listed under http://192.168.1.2:/jars/.jar 8) Enjoy this stack trace periodically appearing on the client machine when the nodes in the remote cluster cant connect to 192.168.1.2 to get your JAR {noformat} 16/03/17 03:25:55 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 5, 192.168.255.11): java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:588) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} was: There is no option to specify which hostname/IP address the jar/file server listens on, and rather than using "spark.driver.host" if specified, the jar/file server will listen on the system's primary IP address. This is an issue when submitting an application in client mode on a machine with two NICs connected to two different networks. Steps to reproduce: 1) Have a cluster in a remote network, whose master is on 192.168.255.10 2) Have a machine at another
[jira] [Commented] (SPARK-13968) Use MurmurHash3 for hashing String features
[ https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200254#comment-15200254 ] Nick Pentreath commented on SPARK-13968: Sure, I will assign to you. But I'd like to get some thoughts from [~mengxr] and [~josephkb] about this and the umbrella for feature hashing improvements (especially around the API / transformer behaviour) before starting work on these tickets. > Use MurmurHash3 for hashing String features > --- > > Key: SPARK-13968 > URL: https://issues.apache.org/jira/browse/SPARK-13968 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Nick Pentreath >Priority: Minor > > Typically feature hashing is done on strings, i.e. feature names (or in the > case of raw feature indexes, either the string representation of the > numerical index can be used, or the index used "as-is" and not hashed). > It is common to use a well-distributed hash function such as MurmurHash3. > This is the case in e.g. > [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher]. > Currently Spark's {{HashingTF}} uses the object's hash code. Look at using > MurmurHash3 (at least for {{String}} which is the common case). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13938) word2phrase feature created in ML
Steve Weng created SPARK-13938: -- Summary: word2phrase feature created in ML Key: SPARK-13938 URL: https://issues.apache.org/jira/browse/SPARK-13938 Project: Spark Issue Type: New Feature Components: ML Reporter: Steve Weng Priority: Critical I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which transforms a sentence of words into one where certain individual consecutive words are concatenated by using a training model/estimator (e.g. "I went to New York" becomes "I went to new_york"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13958) Executor OOM due to unbounded growth of pointer array in Sorter
[ https://issues.apache.org/jira/browse/SPARK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13958: Assignee: (was: Apache Spark) > Executor OOM due to unbounded growth of pointer array in Sorter > --- > > Key: SPARK-13958 > URL: https://issues.apache.org/jira/browse/SPARK-13958 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.6.1 >Reporter: Sital Kedia > > While running a job we saw that the executors are OOMing because in > UnsafeExternalSorter's growPointerArrayIfNecessary function, we are just > growing the pointer array indefinitely. > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L292 > This is a regression introduced in PR- > https://github.com/apache/spark/pull/11095 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10574: -- Assignee: Yanbo Liang > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Assignee: Yanbo Liang >Priority: Critical > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13951) PySpark ml.pipeline support export/import - nested Piplines
Joseph K. Bradley created SPARK-13951: - Summary: PySpark ml.pipeline support export/import - nested Piplines Key: SPARK-13951 URL: https://issues.apache.org/jira/browse/SPARK-13951 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13988) Large history files block new applications from showing up in History UI.
Parth Brahmbhatt created SPARK-13988: Summary: Large history files block new applications from showing up in History UI. Key: SPARK-13988 URL: https://issues.apache.org/jira/browse/SPARK-13988 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.1 Reporter: Parth Brahmbhatt Some of our Spark users complain that their application was not showing up in history server UI. Our analysis suggests that this is a side effect of some application’s event log being too big. This is especially true for spark ML applications that may have lot of iterations but is applicable to other kind of spark jobs too. For example on my local machine just running the following generates an event log of size 80MB. {code} ./spark-shell --master yarn --deploy-mode client --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://localhost:9000/tmp/spark-events val words = sc.textFile(“test.txt”) for(i <- 1 to 1) words.count sc.close {code} For one of our user this file was as big as 12GB. He was running logistic regression using spark ML. Given each application generates its own application event log and event logs are processed serially in a single thread, one huge application can result in lot of users not being able to view their application on the main UI. To overcome this issue I propose to make the replay execution multi threaded so a single large event log won’t block other applications from being rendered into UI. This still cannot solve the issue completely if there are too many large event logs but the alternatives I have considered (Read chunks from begin and end to get Application Start and End event, Modify the event log format so it has this info in header or footer) are all more intrusive. In addition there are several other things we can do to improve History Server implementation. * During the log checker phase to identify application start and end time the replaying thread processes the whole event log and throws away all the info apart from application start and end event. This is pretty huge waste given as soon as a user clicks on the application we reprocess the same event log to get job/task details. We should either optimize the first level of parsing so it reads some chunks from beginning and end to identify the application level details or better yet cache the job/task level details when we process the file for the first time. * On the details job page there is no pagination and we only show the last 1000 job events when there are > 1000 job events. Granted when users have more than 1K jobs they probably won't page through them but not even having that option is bad experience. Also if that page is paginated we could probably do away with partial processing of the event log until the user wants to view the next page. This can help in cases where processing really large files causes OOM issues as we will only be processing a subset of the file. * On startup, the history server reprocesses the whole event log. For the top level application details, we could persist the processing results from the last run in a more compact and searchable format to improve the bootstrap time. This is briefly mentioned in SPARK-6951. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10574: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-13964 > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Priority: Critical > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13886) ArrayType of BinaryType not supported in Row.equals method
[ https://issues.apache.org/jira/browse/SPARK-13886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198763#comment-15198763 ] MahmoudHanafy commented on SPARK-13886: --- I think List extends Seq !! In this case, How can you differentiate between: 1- ArrayType(ByteType) => Seq[Byte] 2- BinaryType => Array[Byte] > ArrayType of BinaryType not supported in Row.equals method > --- > > Key: SPARK-13886 > URL: https://issues.apache.org/jira/browse/SPARK-13886 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: MahmoudHanafy >Priority: Minor > > There are multiple types that are supoprted by Spark SQL, One of them is > ArrayType(Seq) which can be of any element type > So it can be BinaryType(Array\[Byte\]) > In equals method in Row class, there is no handling for ArrayType of > BinaryType. > So for example: > {code:xml} > val a = Row( Seq( Array(1.toByte) ) ) > val b = Row( Seq( Array(1.toByte) ) ) > a.equals(b) // this will return false > {code} > Also, this doesn't work for MapType of BinaryType. > {code:xml} > val a = Row( Map(1 -> Array(1.toByte) ) ) > val b = Row( Map(1 -> Array(1.toByte) ) ) > a.equals(b) // this will return false > {code} > Question1: Can the key in MapType be of BinaryType ? > Question2: Isn't there another way to handle BinaryType by using scala type > instead of Array ? > I want to contribute by fixing this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203027#comment-15203027 ] Cody Koeninger commented on SPARK-12177: Unless I'm misunderstanding your point, those changes are all in my fork already. Keeping a message handler for messageandmetadata doesn't make sense. Backwards compatibility with the existing direct stream isn't really workable. > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13997) Use Hadoop 2.0 default value for compression in data sources
[ https://issues.apache.org/jira/browse/SPARK-13997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13997: Assignee: (was: Apache Spark) > Use Hadoop 2.0 default value for compression in data sources > > > Key: SPARK-13997 > URL: https://issues.apache.org/jira/browse/SPARK-13997 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > Currently, JSON, TEXT and CSV data sources use {{CompressionCodecs}} class to > set compression configurations via {{option("compress", "codec")}}. > I made this uses Hadoop 1.x default value (block level compression). However, > the default value in Hadoop 2.x is record level compression as described in > [mapred-site.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]. > Since it drops Hadoop 1.x, it will make sense to use Hadoop 2.x default > values. > According to [Hadoop Definitive Guide 3th > edition|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch04.html], > it looks configurations for the unit of compression (record or block). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200637#comment-15200637 ] JESSE CHEN commented on SPARK-13865: This maybe a TPC toolkit issue. Will be looking into this with John on my team who is one of the TPC board member. > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13995) Constraints should take care of Cast
Liang-Chi Hsieh created SPARK-13995: --- Summary: Constraints should take care of Cast Key: SPARK-13995 URL: https://issues.apache.org/jira/browse/SPARK-13995 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh We infer relative constraints from logical plan's expressions. However, we don't consider Cast expression now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13967) Add binary toggle Param to PySpark CountVectorizer
Nick Pentreath created SPARK-13967: -- Summary: Add binary toggle Param to PySpark CountVectorizer Key: SPARK-13967 URL: https://issues.apache.org/jira/browse/SPARK-13967 Project: Spark Issue Type: New Feature Components: ML Reporter: Nick Pentreath Priority: Minor See SPARK-13629 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12719) SQL generation support for generators (including UDTF)
[ https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199852#comment-15199852 ] Apache Spark commented on SPARK-12719: -- User 'yy2016' has created a pull request for this issue: https://github.com/apache/spark/pull/11787 > SQL generation support for generators (including UDTF) > -- > > Key: SPARK-12719 > URL: https://issues.apache.org/jira/browse/SPARK-12719 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13960) HTTP-based JAR Server doesn't respect spark.driver.host and there is no "spark.fileserver.host" option
[ https://issues.apache.org/jira/browse/SPARK-13960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ostrovskiy updated SPARK-13960: Description: There is no option to specify which hostname/IP address the jar/file server listens on, and rather than using "spark.driver.host" if specified, the jar/file server will listen on the system's primary IP address. This is an issue when submitting an application in client mode on a machine with two NICs connected to two different networks. Steps to reproduce: 1) Have a cluster in a remote network, whose master is on 192.168.255.10 2) Have a machine at another location, with a "primary" IP address of 192.168.1.2, connected to the "remote network" as well, with the IP address 192.168.255.250. Let's call this the "client machine". 3) Ensure every machine in the spark cluster at the remote location can ping 192.168.255.250 and reach the client machine via that address. 4) On the client: {noformat} spark-submit --deploy-mode client --conf "spark.driver.host=192.168.255.250" --master spark://192.168.255.10:7077 --class {noformat} 5) Navigate to http://192.168.255.250:4040/ and ensure that executors from the remote cluster have found the driver on the client machine 6) Navigate to http://192.168.255.250:4040/environment/, and scroll to the bottom 7) Observe that the JAR you specified in Step 4 will be listed under http://192.168.1.2:/jars/.jar was: There is no option to specify which hostname/IP address the jar/file server listens on, and rather than using "spark.driver.host" if specified, the jar/file server will listen on the system's primary IP address. This is an issue when submitting an application in client mode on a machine with two NICs connected to two different networks. Steps to reproduce: 1) Have a cluster in a remote network, whose master is on 192.168.255.10 2) Have a machine at another location, with a "primary" IP address of "192.168.1.2", connected to the "remote network" as well, with the IP address "192.168.255.250". Let's call this the "client machine". 3) Ensure every machine in the spark cluster at the remote location can ping "192.168.255.250" and reach the client machine via that address. 4) On the client: {noformat} spark-submit --deploy-mode client --conf "spark.driver.host=192.168.255.250" --master spark://192.168.255.10:7077 --class {noformat} 5) Navigate to "http://192.168.255.250:4040/; and ensure that executors from the remote cluster have found the driver on the client machine 6) Navigate to "http://192.168.255.250:4040/environment/;, and scroll to the bottom 7) Observe that the JAR you specified in Step 4 will be listed under "http://192.168.1.2:/jars/.jar" 8) Grok source and documentation to see if there's any way to change that 9) Submit this issue > HTTP-based JAR Server doesn't respect spark.driver.host and there is no > "spark.fileserver.host" option > -- > > Key: SPARK-13960 > URL: https://issues.apache.org/jira/browse/SPARK-13960 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 1.6.1 > Environment: Any system with more than one IP address >Reporter: Ilya Ostrovskiy > > There is no option to specify which hostname/IP address the jar/file server > listens on, and rather than using "spark.driver.host" if specified, the > jar/file server will listen on the system's primary IP address. This is an > issue when submitting an application in client mode on a machine with two > NICs connected to two different networks. > Steps to reproduce: > 1) Have a cluster in a remote network, whose master is on 192.168.255.10 > 2) Have a machine at another location, with a "primary" IP address of > 192.168.1.2, connected to the "remote network" as well, with the IP address > 192.168.255.250. Let's call this the "client machine". > 3) Ensure every machine in the spark cluster at the remote location can ping > 192.168.255.250 and reach the client machine via that address. > 4) On the client: > {noformat} > spark-submit --deploy-mode client --conf "spark.driver.host=192.168.255.250" > --master spark://192.168.255.10:7077 --class > > {noformat} > 5) Navigate to http://192.168.255.250:4040/ and ensure that executors from > the remote cluster have found the driver on the client machine > 6) Navigate to http://192.168.255.250:4040/environment/, and scroll to the > bottom > 7) Observe that the JAR you specified in Step 4 will be listed under > http://192.168.1.2:/jars/.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Assigned] (SPARK-13992) Add support for off-heap caching
[ https://issues.apache.org/jira/browse/SPARK-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13992: Assignee: Josh Rosen (was: Apache Spark) > Add support for off-heap caching > > > Key: SPARK-13992 > URL: https://issues.apache.org/jira/browse/SPARK-13992 > Project: Spark > Issue Type: New Feature >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should add support for caching serialized data off-heap within the same > process (i.e. using direct buffers or sun.misc.unsafe). > I'll expand this JIRA later with more detail (filing now as a placeholder). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13967) Add binary toggle Param to PySpark CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-13967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201339#comment-15201339 ] Nick Pentreath commented on SPARK-13967: [~yuhaoyan] or [~bryanc] would you like to take this? > Add binary toggle Param to PySpark CountVectorizer > -- > > Key: SPARK-13967 > URL: https://issues.apache.org/jira/browse/SPARK-13967 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > See SPARK-13629 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13989) Remove non-vectorized/unsafe-row parquet record reader
[ https://issues.apache.org/jira/browse/SPARK-13989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13989. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11799 [https://github.com/apache/spark/pull/11799] > Remove non-vectorized/unsafe-row parquet record reader > -- > > Key: SPARK-13989 > URL: https://issues.apache.org/jira/browse/SPARK-13989 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal >Priority: Minor > Fix For: 2.0.0 > > > Clean up the new parquet record reader by removing the non-vectorized parquet > reader code from `UnsafeRowParquetRecordReader`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14016) Support high-precision decimals in vectorized parquet reader
Sameer Agarwal created SPARK-14016: -- Summary: Support high-precision decimals in vectorized parquet reader Key: SPARK-14016 URL: https://issues.apache.org/jira/browse/SPARK-14016 Project: Spark Issue Type: Sub-task Reporter: Sameer Agarwal -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13986) Make `DeveloperApi`-annotated things public
[ https://issues.apache.org/jira/browse/SPARK-13986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200489#comment-15200489 ] Apache Spark commented on SPARK-13986: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11797 > Make `DeveloperApi`-annotated things public > --- > > Key: SPARK-13986 > URL: https://issues.apache.org/jira/browse/SPARK-13986 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Reporter: Dongjoon Hyun >Priority: Minor > > Spark uses `@DeveloperApi` annotation, but sometimes it seems to conflict > with its visibility. This issue proposes to fix those conflict. The following > is the example. > {code:title=JobResult.scala|borderStyle=solid} > @DeveloperApi > sealed trait JobResult > @DeveloperApi > case object JobSucceeded extends JobResult > @DeveloperApi > -private[spark] case class JobFailed(exception: Exception) extends JobResult > +case class JobFailed(exception: Exception) extends JobResult > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203018#comment-15203018 ] Eugene Miretsky commented on SPARK-12177: - The new Kafka Java Consumer is using Deserializer instead of Decoder. The difference is not too big (extra type safety, and Deserializer::deserialize accepts a topic and a byte payload, while Decoder::fromBytes accepts only a byte payload), but still it would be nice to align with the new Kafka consumer. Would it make sense to replace Decoder with Deserializer in the new DirectStream? This would require getting rid of MessageAndMetadata, and hence breaking backwards compatibility with the existing DirectStream, but I guess it will have to be done at some point. > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13976) do not remove sub-queries added by user when generate SQL
[ https://issues.apache.org/jira/browse/SPARK-13976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13976: Assignee: Apache Spark > do not remove sub-queries added by user when generate SQL > - > > Key: SPARK-13976 > URL: https://issues.apache.org/jira/browse/SPARK-13976 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13940) Predicate Transitive Closure Transformation
Alex Antonov created SPARK-13940: Summary: Predicate Transitive Closure Transformation Key: SPARK-13940 URL: https://issues.apache.org/jira/browse/SPARK-13940 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 1.6.0 Reporter: Alex Antonov A relatively simple transformation is missing from Catalyst's arsenal - generation of transitive predicates. For instance, if you have got the following query: {code} select * from table1 t1 join table2 t2 on t1.a = t2.b where t1.a = 42 {code} then it is a fair assumption that t2.b also equals 42 hence an additional predicate could be generated. The additional predicate could in turn be pushed down through the join and improve performance of the whole query by filtering out the data before joining it. Such an transformation exists in Oracle DB and called transitive closure which hopefully should explain the title of this Jira -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13937) PySpark ML JavaWrapper, variable _java_obj should not be static
[ https://issues.apache.org/jira/browse/SPARK-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-13937: -- Priority: Trivial (was: Minor) > PySpark ML JavaWrapper, variable _java_obj should not be static > --- > > Key: SPARK-13937 > URL: https://issues.apache.org/jira/browse/SPARK-13937 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Trivial > Fix For: 2.0.0 > > > In PySpark ML wrapper.py, the abstract class {{JavaWrapper}} has a static > variable {{_java_obj}}. This is meant to hold an instance of a companion > Java object. It seems as though it was made static accidentally because it > is never used, and all assignments done in derived classes are done to a > member variable with {{self._java_obj}}. This does not cause any problems > with the current functionality, but it should be changed so as not to cause > any confusion and misuse in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13938) word2phrase feature created in ML
[ https://issues.apache.org/jira/browse/SPARK-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13938: -- [~s4weng] "Critical" is inappropriate here. Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. It's great you're implementing things on Spark, but they generally don't belong in Spark itself. I'm going to close this but you can start by providing your package via spark-packages.org > word2phrase feature created in ML > - > > Key: SPARK-13938 > URL: https://issues.apache.org/jira/browse/SPARK-13938 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Steve Weng >Priority: Critical > Original Estimate: 840h > Remaining Estimate: 840h > > I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which > transforms a sentence of words into one where certain individual consecutive > words are concatenated by using a training model/estimator (e.g. "I went to > New York" becomes "I went to new_york"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-913) log the size of each shuffle block in block manager
[ https://issues.apache.org/jira/browse/SPARK-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-913: -- Assignee: Apache Spark > log the size of each shuffle block in block manager > --- > > Key: SPARK-913 > URL: https://issues.apache.org/jira/browse/SPARK-913 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Reynold Xin >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13973) `ipython notebook` is going away...
Bogdan Pirvu created SPARK-13973: Summary: `ipython notebook` is going away... Key: SPARK-13973 URL: https://issues.apache.org/jira/browse/SPARK-13973 Project: Spark Issue Type: Improvement Components: PySpark Environment: spark-1.6.1-bin-hadoop2.6 Anaconda2-2.5.0-Linux-x86_64 Reporter: Bogdan Pirvu Priority: Trivial Starting {{pyspark}} with following environment variables: {code:none} export IPYTHON=1 export IPYTHON_OPTS="notebook --no-browser" {code} yields this warning {code:none} [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions. [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... continue in 5 sec. Press Ctrl-C to quit now. {code} Changing line 52 from {code:none} PYSPARK_DRIVER_PYTHON="ipython" {code} to {code:none} PYSPARK_DRIVER_PYTHON="jupyter" {code} in https://github.com/apache/spark/blob/master/bin/pyspark works for me to solve this issue, but I'm not sure if it's sustainable as I'm not familiar with the rest of the code... This is the relevant part of my Python environment: {code:none} ipython 4.1.2py27_0 ipython-genutils 0.1.0 ipython_genutils 0.1.0py27_0 ipywidgets4.1.1py27_0 ... jupyter 1.0.0py27_1 jupyter-client4.2.1 jupyter-console 4.1.1 jupyter-core 4.1.0 jupyter_client4.2.1py27_0 jupyter_console 4.1.1py27_0 jupyter_core 4.1.0py27_0 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13974) sub-query names do not need to be globally unique while generate SQL
Wenchen Fan created SPARK-13974: --- Summary: sub-query names do not need to be globally unique while generate SQL Key: SPARK-13974 URL: https://issues.apache.org/jira/browse/SPARK-13974 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13360) pyspark related enviroment variable is not propagated to driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-13360. Resolution: Fixed Assignee: Jeff Zhang Fix Version/s: 2.0.0 > pyspark related enviroment variable is not propagated to driver in > yarn-cluster mode > > > Key: SPARK-13360 > URL: https://issues.apache.org/jira/browse/SPARK-13360 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 1.6.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Fix For: 2.0.0 > > > Such as PYSPARK_DRIVER_PYTHON, PYSPARK_PYTHON, PYTHONHASHSEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14001) support multi-children Union in SQLBuilder
[ https://issues.apache.org/jira/browse/SPARK-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14001: --- Assignee: Wenchen Fan > support multi-children Union in SQLBuilder > -- > > Key: SPARK-14001 > URL: https://issues.apache.org/jira/browse/SPARK-14001 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200306#comment-15200306 ] Hari Shreedharan commented on SPARK-13877: -- You could have separate repos and separate releases, and keep the same package names simply by doing sub-projects. Can you explain what the overhead is and what tools you are concerned about? > Consider removing Kafka modules from Spark / Spark Streaming > > > Key: SPARK-13877 > URL: https://issues.apache.org/jira/browse/SPARK-13877 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Streaming >Affects Versions: 1.6.1 >Reporter: Hari Shreedharan > > Based on the discussion the PR for SPARK-13843 > ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), > we should consider moving the Kafka modules out of Spark as well. > Providing newer functionality (like security) has become painful while > maintaining compatibility with older versions of Kafka. Moving this out > allows more flexibility, allowing users to mix and match Kafka and Spark > versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13993) PySpark ml.feature.RFormula/RFormulaModel support export/import
Xusen Yin created SPARK-13993: - Summary: PySpark ml.feature.RFormula/RFormulaModel support export/import Key: SPARK-13993 URL: https://issues.apache.org/jira/browse/SPARK-13993 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Xusen Yin Priority: Minor Add save/load for RFormula and its model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13937) PySpark ML JavaWrapper, variable _java_obj should not be static
[ https://issues.apache.org/jira/browse/SPARK-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13937: Assignee: (was: Apache Spark) > PySpark ML JavaWrapper, variable _java_obj should not be static > --- > > Key: SPARK-13937 > URL: https://issues.apache.org/jira/browse/SPARK-13937 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > > In PySpark ML wrapper.py, the abstract class {{JavaWrapper}} has a static > variable {{_java_obj}}. This is meant to hold an instance of a companion > Java object. It seems as though it was made static accidentally because it > is never used, and all assignments done in derived classes are done to a > member variable with {{self._java_obj}}. This does not cause any problems > with the current functionality, but it should be changed so as not to cause > any confusion and misuse in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13955) Spark in yarn mode fails
[ https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199198#comment-15199198 ] Sean Owen commented on SPARK-13955: --- Is this likely? the YARN tests succeed. There isn't detail here like what you are running. > Spark in yarn mode fails > > > Key: SPARK-13955 > URL: https://issues.apache.org/jira/browse/SPARK-13955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > Seems the spark assembly jar is not uploaded to AM. This may be known issue > in the process of SPARK-11157, create this ticket to track this issue. > [~vanzin] > {noformat} > 16/03/17 11:58:59 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive > is set, falling back to uploading libraries under SPARK_HOME. > 16/03/17 11:58:59 INFO Client: Uploading resource > file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.10.jar > 16/03/17 11:58:59 INFO Client: Uploading resource > file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.11.jar > 16/03/17 11:59:00 INFO Client: Uploading resource > file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-36cacbad-ca5b-482b-8ca8-607499acaaba/__spark_conf__4427292248554277597.zip > -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/__spark_conf__4427292248554277597.zip > {noformat} > message in AM container > {noformat} > Error: Could not find or load main class > org.apache.spark.deploy.yarn.ExecutorLauncher > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13719) Bad JSON record raises java.lang.ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-13719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13719: Assignee: (was: Apache Spark) > Bad JSON record raises java.lang.ClassCastException > > > Key: SPARK-13719 > URL: https://issues.apache.org/jira/browse/SPARK-13719 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 > Environment: OS X, Linux >Reporter: dmtran >Priority: Minor > > I have defined a JSON schema, using org.apache.spark.sql.types.StructType, > that expects this kind of record : > {noformat} > { > "request": { > "user": { > "id": 123 > } > } > } > {noformat} > There's a bad record in my dataset, that defines field "user" as an array, > instead of a JSON object : > {noformat} > { > "request": { > "user": [] > } > } > {noformat} > The following exception is raised because of that bad record : > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: > Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): > java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData > cannot be cast to org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) > at > org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Here's a code snippet that reproduces the exception : > {noformat} > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.{SQLContext, DataFrame} > import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.types.{StringType, StructField, StructType} > object Snippet { > def main(args : Array[String]): Unit = { > val sc = new SparkContext() > implicit val sqlContext = new HiveContext(sc) > val rdd: RDD[String] = sc.parallelize(Seq(badRecord)) > val df: DataFrame = sqlContext.read.schema(schema).json(rdd) > import sqlContext.implicits._ > df.select("request.user.id") > .filter($"id".isNotNull) > .count() > } > val badRecord = > s"""{ > | "request": { > |"user": [] > | } > |}""".stripMargin.replaceAll("\n", " ") // Convert the multiline > string to a signe line string > val schema = > StructType( > StructField("request",
[jira] [Commented] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198602#comment-15198602 ] Xiao Li commented on SPARK-13864: - This is the same issue as SPARK-13862. I think we can close this. Thanks! > TPCDS query 74 returns wrong results compared to TPC official result set > - > > Key: SPARK-13864 > URL: https://issues.apache.org/jira/browse/SPARK-13864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 74 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > Spark SQL has right answer but in wrong order (and there is an 'order by' in > the query). > Actual results: > {noformat} > [BLEIBAAA,Paula,Wakefield] > [DFIEBAAA,John,Gray] > [OCLBBAAA,null,null] > [PKBCBAAA,Andrea,White] > [EJDL,Alice,Wright] > [FACE,Priscilla,Miller] > [LFKK,Ignacio,Miller] > [LJNCBAAA,George,Gamez] > [LIOP,Derek,Allen] > [EADJ,Ruth,Carroll] > [JGMM,Richard,Larson] > [PKIK,Wendy,Horvath] > [FJHF,Larissa,Roy] > [EPOG,Felisha,Mendes] > [EKJL,Aisha,Carlson] > [HNFH,Rebecca,Wilson] > [IBFCBAAA,Ruth,Grantham] > [OPDL,Ann,Pence] > [NIPL,Eric,Lawrence] > [OCIC,Zachary,Pennington] > [OFLC,James,Taylor] > [GEHI,Tyler,Miller] > [CADP,Cristobal,Thomas] > [JIAL,Santos,Gutierrez] > [PMMBBAAA,Paul,Jordan] > [DIIO,David,Carroll] > [DFKABAAA,Latoya,Craft] > [HMOI,Grace,Henderson] > [PPIBBAAA,Candice,Lee] > [JONHBAAA,Warren,Orozco] > [GNDA,Terry,Mcdowell] > [CIJM,Elizabeth,Thomas] > [DIJGBAAA,Ruth,Sanders] > [NFBDBAAA,Vernice,Fernandez] > [IDKF,Michael,Mack] > [IMHB,Kathy,Knowles] > [LHMC,Brooke,Nelson] > [CFCGBAAA,Marcus,Sanders] > [NJHCBAAA,Christopher,Schreiber] > [PDFB,Terrance,Banks] > [ANFA,Philip,Banks] > [IADEBAAA,Diane,Aldridge] > [ICHF,Linda,Mccoy] > [CFEN,Christopher,Dawson] > [KOJJ,Gracie,Mendoza] > [FOJA,Don,Castillo] > [FGPG,Albert,Wadsworth] > [KJBK,Georgia,Scott] > [EKFP,Annika,Chin] > [IBAEBAAA,Sandra,Wilson] > [MFFL,Margret,Gray] > [KNAK,Gladys,Banks] > [CJDI,James,Kerr] > [OBADBAAA,Elizabeth,Burnham] > [AMGD,Kenneth,Harlan] > [HJLA,Audrey,Beltran] > [AOPFBAAA,Jerry,Fields] > [CNAGBAAA,Virginia,May] > [HGOABAAA,Sonia,White] > [KBCABAAA,Debra,Bell] > [NJAG,Allen,Hood] > [MMOBBAAA,Margaret,Smith] > [NGDBBAAA,Carlos,Jewell] > [FOGI,Michelle,Greene] > [JEKFBAAA,Norma,Burkholder] > [OCAJ,Jenna,Staton] > [PFCL,Felicia,Neville] > [DLHBBAAA,Henry,Bertrand] > [DBEFBAAA,Bennie,Bowers] > [DCKO,Robert,Gonzalez] > [KKGE,Katie,Dunbar] > [GFMDBAAA,Kathleen,Gibson] > [IJEM,Charlie,Cummings] > [KJBL,Kerry,Davis] > [JKBN,Julie,Kern] > [MDCA,Louann,Hamel] > [EOAK,Molly,Benjamin] > [IBHH,Jennifer,Ballard] > [PJEN,Ashley,Norton] > [KLHHBAAA,Manuel,Castaneda] > [IMHHBAAA,Lillian,Davidson] > [GHPBBAAA,Nick,Mendez] > [BNBB,Irma,Smith] > [FBAH,Michael,Williams] > [PEHEBAAA,Edith,Molina] > [FMHI,Emilio,Darling] > [KAEC,Milton,Mackey] > [OCDJ,Nina,Sanchez] > [FGIG,Eduardo,Miller] > [FHACBAAA,null,null] > [HMJN,Ryan,Baptiste] > [HHCABAAA,William,Stewart] > {noformat} > Expected results: > {noformat} > +--+-++ > | CUSTOMER_ID | CUSTOMER_FIRST_NAME | CUSTOMER_LAST_NAME | > +--+-++ > | AMGD | Kenneth | Harlan | > | ANFA | Philip | Banks | > | AOPFBAAA | Jerry | Fields | > | BLEIBAAA | Paula | Wakefield | > | BNBB | Irma| Smith | > | CADP | Cristobal | Thomas | > | CFCGBAAA | Marcus | Sanders| > | CFEN |
[jira] [Updated] (SPARK-12719) SQL generation support for generators (including UDTF)
[ https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-12719: --- Assignee: Wenchen Fan > SQL generation support for generators (including UDTF) > -- > > Key: SPARK-12719 > URL: https://issues.apache.org/jira/browse/SPARK-12719 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200886#comment-15200886 ] JESSE CHEN commented on SPARK-13865: You rock! > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13886) ArrayType of BinaryType not supported in Row.equals method
[ https://issues.apache.org/jira/browse/SPARK-13886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198785#comment-15198785 ] Rishabh Bhardwaj commented on SPARK-13886: -- If we go through the implementation of `a.equals(b)` in Row then this comparison boils down comparing `Array(1.toByte) == Array(1.toByte)`, and since scala uses java Array which is jvm binded so this comparison return false.This is not the case if you use List.This is explained here in detail: http://goo.gl/1zVjnx Correct me if I am going off track here. > ArrayType of BinaryType not supported in Row.equals method > --- > > Key: SPARK-13886 > URL: https://issues.apache.org/jira/browse/SPARK-13886 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: MahmoudHanafy >Priority: Minor > > There are multiple types that are supoprted by Spark SQL, One of them is > ArrayType(Seq) which can be of any element type > So it can be BinaryType(Array\[Byte\]) > In equals method in Row class, there is no handling for ArrayType of > BinaryType. > So for example: > {code:xml} > val a = Row( Seq( Array(1.toByte) ) ) > val b = Row( Seq( Array(1.toByte) ) ) > a.equals(b) // this will return false > {code} > Also, this doesn't work for MapType of BinaryType. > {code:xml} > val a = Row( Map(1 -> Array(1.toByte) ) ) > val b = Row( Map(1 -> Array(1.toByte) ) ) > a.equals(b) // this will return false > {code} > Question1: Can the key in MapType be of BinaryType ? > Question2: Isn't there another way to handle BinaryType by using scala type > instead of Array ? > I want to contribute by fixing this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13942) Remove Shark-related docs and visibility for 2.x
Dongjoon Hyun created SPARK-13942: - Summary: Remove Shark-related docs and visibility for 2.x Key: SPARK-13942 URL: https://issues.apache.org/jira/browse/SPARK-13942 Project: Spark Issue Type: Task Components: Documentation, Spark Core Reporter: Dongjoon Hyun Priority: Minor `Shark` was merged into `Spark SQL` since [July 2014|https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html]. The followings seem to be the only legacy. *Migration Guide* {code:title=sql-programming-guide.md|borderStyle=solid} - ## Migration Guide for Shark Users - ... - ### Scheduling - ... - ### Reducer number - ... - ### Caching {code} *SparkEnv visibility and comments* {code:title=sql-programming-guide.md|borderStyle=solid} - * - * NOTE: This is not intended for external use. This is exposed for Shark and may be made private - * in a future release. */ @DeveloperApi -class SparkEnv ( +private[spark] class SparkEnv ( {code} For Spark 2.x, we had better clean up those docs and comments in any way. However, the visibility of `SparkEnv` class might be controversial. At the first attempt, this issue proposes to change both stuff according to the note(`This is exposed for Shark`). During review process, the change on visibility might be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14001) support multi-children Union in SQLBuilder
[ https://issues.apache.org/jira/browse/SPARK-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201191#comment-15201191 ] Apache Spark commented on SPARK-14001: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/11818 > support multi-children Union in SQLBuilder > -- > > Key: SPARK-14001 > URL: https://issues.apache.org/jira/browse/SPARK-14001 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13664) Simplify and Speedup HadoopFSRelation
[ https://issues.apache.org/jira/browse/SPARK-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13664: Assignee: Michael Armbrust (was: Apache Spark) > Simplify and Speedup HadoopFSRelation > - > > Key: SPARK-13664 > URL: https://issues.apache.org/jira/browse/SPARK-13664 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > Fix For: 2.0.0 > > > A majority of Spark SQL queries likely run though {{HadoopFSRelation}}, > however there are currently several complexity and performance problems with > this code path: > - The class mixes the concerns of file management, schema reconciliation, > scan building, bucketing, partitioning, and writing data. > - For very large tables, we are broadcasting the entire list of files to > every executor. [SPARK-11441] > - For partitioned tables, we always do an extra projection. This results > not only in a copy, but undoes much of the performance gains that we are > going to get from vectorized reads. > This is an umbrella ticket to track a set of improvements to this codepath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org