[jira] [Commented] (SPARK-13861) TPCDS query 40 returns wrong results compared to TPC official result set

2016-03-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198526#comment-15198526
 ] 

Xiao Li commented on SPARK-13861:
-

Great job! I am just wondering if only cs_sales_price has a wrong definition? 

> TPCDS query 40 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13861
> URL: https://issues.apache.org/jira/browse/SPARK-13861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 40 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABBD) ; I believe 5 
> rows are missing in total.
> Actual results:
> {noformat}
> [TN,AABD,0.0,-82.060899353]
> [TN,AACD,-216.54000234603882,158.0399932861328]
> [TN,AAHD,186.54999542236328,0.0]
> [TN,AALA,0.0,48.2254223633]
> [TN,ACGC,63.67999863624573,0.0]
> [TN,ACHC,102.6830517578,51.8838964844]
> [TN,ACKC,128.9235150146,44.8169482422]
> [TN,ACLD,205.43999433517456,-948.619930267334]
> [TN,ACOB,207.32000732421875,24.88389648438]
> [TN,ACPD,87.75,53.9900016784668]
> [TN,ADGB,44.310001373291016,222.4800033569336]
> [TN,ADKB,0.0,-471.8699951171875]
> [TN,AEAD,58.2400016784668,0.0]
> [TN,AEOC,19.9084741211,214.7076293945]
> [TN,AFAC,271.8199977874756,163.1699981689453]
> [TN,AFAD,2.349046325684,28.3169482422]
> [TN,AFDC,-378.0499496459961,-303.26999282836914]
> [TN,AGID,307.6099967956543,-19.29915527344]
> [TN,AHDE,80.574468689,-476.7200012207031]
> [TN,AHHA,8.27457763672,155.1276565552]
> [TN,AHJB,39.23999857902527,0.0]
> [TN,AIEC,82.3675750732,3.910858306885]
> [TN,AIEE,20.39618530273,-151.08999633789062]
> [TN,AIMC,24.46313354492,-150.330517578]
> [TN,AJAC,49.0915258789,82.084741211]
> [TN,AJCA,121.18000221252441,63.779998779296875]
> [TN,AJKB,27.94534057617,8.97267028809]
> [TN,ALBE,88.2599983215332,30.22542236328]
> [TN,ALCE,93.5245776367,92.0198092651]
> [TN,ALEC,64.179019165,15.1584741211]
> [TN,ALNB,4.19809265137,148.27000427246094]
> [TN,AMBE,28.44534057617,0.0]
> [TN,AMPB,0.0,131.92999839782715]
> [TN,ANFE,0.0,-137.3400115966797]
> [TN,AOIB,150.40999603271484,254.288058548]
> [TN,APJB,45.2745776367,334.482015991]
> [TN,APLA,50.2076293945,29.150001049041748]
> [TN,APLD,0.0,32.3838964844]
> [TN,BAPD,93.41999816894531,145.8699951171875]
> [TN,BBID,296.774577637,30.95084472656]
> [TN,BDCE,-1771.0800704956055,-54.779998779296875]
> [TN,BDDD,111.12000274658203,280.5899963378906]
> [TN,BDJA,0.0,79.5423706055]
> [TN,BEFD,0.0,3.429475479126]
> [TN,BEOD,269.838964844,297.5800061225891]
> [TN,BFMB,110.82999801635742,-941.4000930786133]
> [TN,BFNA,47.8661035156,0.0]
> [TN,BFOC,46.3415258789,83.5245776367]
> [TN,BHPC,27.378392334,77.61999893188477]
> [TN,BIDB,196.6199951171875,5.57171661377]
> [TN,BIGB,425.3399963378906,0.0]
> [TN,BIJB,209.6300048828125,0.0]
> [TN,BJFE,7.32923706055,55.1584741211]
> [TN,BKFA,0.0,138.14000129699707]
> [TN,BKMC,27.17076293945,54.970001220703125]
> [TN,BLDE,170.28999400138855,0.0]
> [TN,BNHB,58.0594277954,-337.8899841308594]
> [TN,BNID,54.41525878906,35.01504089355]
> [TN,BNLA,0.0,168.37999629974365]
> [TN,BNLD,0.0,96.4084741211]
> [TN,BNMC,202.40999698638916,49.52999830245972]
> [TN,BOCC,4.73019073486,69.83999633789062]
> [TN,BOMB,63.66999816894531,163.49000668525696]
> [TN,CAAA,121.91000366210938,0.0]
> [TN,CAAD,-1107.6099338531494,0.0]
> [TN,CAJC,115.8046594238,173.0519073486]
> [TN,CBCD,18.94534057617,226.38000106811523]
> [TN,CBFA,0.0,97.41000366210938]
> [TN,CBIA,2.14104904175,84.66000366210938]
> [TN,CBPB,95.44000244140625,26.6830517578]
> 

[jira] [Created] (SPARK-13943) The behavior of sum(booleantype) in Spark DataFrames is not intuitive

2016-03-19 Thread Wes McKinney (JIRA)
Wes McKinney created SPARK-13943:


 Summary: The behavior of sum(booleantype) in Spark DataFrames is 
not intuitive
 Key: SPARK-13943
 URL: https://issues.apache.org/jira/browse/SPARK-13943
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Wes McKinney


In NumPy and pandas, summing boolean data produces an integer indicating the 
number of True values:

{code}
In [1]: import numpy as np

In [2]: arr = np.random.randn(100)

In [3]: (arr > 0).sum()
Out[3]: 499065
{code}

In PySpark, {{sql.functions.sum(expr)}} results in an error:

{code}
AnalysisException: u"cannot resolve 'sum((`data0` > CAST(0 AS DOUBLE)))' due to 
data type mismatch: function sum requires numeric types, not BooleanType;"
{code}

FWIW, R is the same:

{code}
> sum(rnorm(100) > 0)
[1] 499139
{code}

Spark should consider emulating the behavior of R and Python in those 
environments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14011) Enable `LineLength` Java checkstyle rule

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14011:


Assignee: Apache Spark

> Enable `LineLength` Java checkstyle rule
> 
>
> Key: SPARK-14011
> URL: https://issues.apache.org/jira/browse/SPARK-14011
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> [Spark Coding Style 
> Guide|https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide]
>  has 100-character limit on lines, but it's disabled for Java since 11/09/15.
> This issue enable *LineLength* checkstyle again. For that, this also 
> introduces *RedundantImport* and *RedundantModifier*, too.
> {code:title=dev/checkstyle.xml|borderStyle=solid}
> -
> -
>  
>  
>  
> @@ -167,5 +164,7 @@
>  
>  
>  
> +
> +
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13963) Add binary toggle Param to ml.HashingTF

2016-03-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13963:
---
Comment: was deleted

(was: Sure, assigned to you.)

> Add binary toggle Param to ml.HashingTF
> ---
>
> Key: SPARK-13963
> URL: https://issues.apache.org/jira/browse/SPARK-13963
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Bryan Cutler
>Priority: Trivial
>
> It would be handy to add a binary toggle Param to {{HashingTF}}, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200821#comment-15200821
 ] 

Xiao Li commented on SPARK-13865:
-

The query I posted here is downloaded from the official website. It is in the 
zip file: "tpc-ds-tools_v2.1.0.zip". I did not see there is any variant for the 
query 87

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13938) word2phrase feature created in ML

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13938:


Assignee: (was: Apache Spark)

> word2phrase feature created in ML
> -
>
> Key: SPARK-13938
> URL: https://issues.apache.org/jira/browse/SPARK-13938
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Steve Weng
>Priority: Critical
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which 
> transforms a sentence of words into one where certain individual consecutive 
> words are concatenated by using a training model/estimator (e.g. "I went to 
> New York" becomes "I went to new_york").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14010) ColumnPruning is conflict with PushPredicateThroughProject

2016-03-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14010:
---
Description: ColumnPruning will insert a Project before Filter, but 

> ColumnPruning is conflict with PushPredicateThroughProject
> --
>
> Key: SPARK-14010
> URL: https://issues.apache.org/jira/browse/SPARK-14010
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> ColumnPruning will insert a Project before Filter, but 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13979) Killed executor is respawned without AWS keys in standalone spark cluster

2016-03-19 Thread Allen George (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen George updated SPARK-13979:
-
Description: 
I'm having a problem where respawning a failed executor during a job that 
reads/writes parquet on S3 causes subsequent tasks to fail because of missing 
AWS keys.

h4. Setup:

I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple 
standalone cluster:

1 master
2 workers

My application is co-located on the master machine, while the two workers are 
on two other machines (one worker per machine). All machines are running in 
EC2. I've configured my setup so that my application executes its task on two 
executors (one executor per worker).

h4. Application:

My application reads and writes parquet files on S3. I set the AWS keys on the 
SparkContext by doing:

val sc = new SparkContext()
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3n.awsAccessKeyId", "SOME_KEY")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "SOME_SECRET")

At this point I'm done, and I go ahead and use "sc".

h4. Issue:

I can read and write parquet files without a problem with this setup. *BUT* if 
an executor dies during a job and is respawned by a worker, tasks fail with the 
following error:

"Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret 
Access Key must be specified as the username or password (respectively) of a 
s3n URL, or by setting the {{fs.s3n.awsAccessKeyId}} or 
{{fs.s3n.awsSecretAccessKey}} properties (respectively)."

h4. Basic analysis

I think I've traced this down to the following:

SparkHadoopUtil is initialized with an empty {{SparkConf}}. Later, classes like 
{{DataSourceStrategy}} simply call {{SparkHadoopUtil.get.conf}} and access the 
(now invalid; missing various properties) {{HadoopConfiguration}} that's built 
from this empty {{SparkConf}} object. It's unclear to me why this is done, and 
it seems that the code as written would cause broken results anytime callers 
use {{SparkHadoopUtil.get.conf}} directly.

  was:
I'm having a problem where respawning a failed executor during a job that 
reads/writes parquet on S3 causes subsequent tasks to fail because of missing 
AWS keys.

Setup:

I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple 
standalone cluster:

1 master
2 workers

My application is co-located on the master machine, while the two workers are 
on two other machines (one worker per machine). All machines are running in 
EC2. I've configured my setup so that my application executes its task on two 
executors (one executor per worker).

Application:

My application reads and writes parquet files on S3. I set the AWS keys on the 
SparkContext by doing:

val sc = new SparkContext()
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3n.awsAccessKeyId", "SOME_KEY")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "SOME_SECRET")

At this point I'm done, and I go ahead and use "sc".

Issue:

I can read and write parquet files without a problem with this setup. *BUT* if 
an executor dies during a job and is respawned by a worker, tasks fail with the 
following error:

"Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret 
Access Key must be specified as the username or password (respectively) of a 
s3n URL, or by setting the {{fs.s3n.awsAccessKeyId}} or 
{{fs.s3n.awsSecretAccessKey}} properties (respectively)."

I think I've traced this down to the following:

SparkHadoopUtil is initialized with an empty {{SparkConf}}. Later, classes like 
{{DataSourceStrategy}} simply call {{SparkHadoopUtil.get.conf}} and access the 
(now invalid; missing various properties) {{HadoopConfiguration}} that's built 
from this empty {{SparkConf}} object. It's unclear to me why this is done, and 
it seems that the code as written would cause broken results anytime callers 
use {{SparkHadoopUtil.get.conf}} directly.


> Killed executor is respawned without AWS keys in standalone spark cluster
> -
>
> Key: SPARK-13979
> URL: https://issues.apache.org/jira/browse/SPARK-13979
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: I'm using Spark 1.5.2 with Hadoop 2.7 and running 
> experiments on a simple standalone cluster:
> 1 master
> 2 workers
> All ubuntu 14.04 with Java 8/Scala 2.10
>Reporter: Allen George
>
> I'm having a problem where respawning a failed executor during a job that 
> reads/writes parquet on S3 causes subsequent tasks to fail because of missing 
> AWS keys.
> h4. Setup:
> I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple 
> standalone cluster:
> 1 master
> 2 workers
> My application is co-located on the master machine, while the two workers are 
> on two other machines 

[jira] [Resolved] (SPARK-13816) Add parameter checks for algorithms in Graphx

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13816.
-
   Resolution: Fixed
 Assignee: zhengruifeng
Fix Version/s: 2.0.0

> Add parameter checks for algorithms in Graphx 
> --
>
> Key: SPARK-13816
> URL: https://issues.apache.org/jira/browse/SPARK-13816
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Add parameter checks in Graphx-Algorithms:
> maxIterations in Pregel 
> maxSteps in LabelPropagation
> numIter,resetProb,tol in PageRank
> maxIters,maxVal,minVal in SVDPlusPlus



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13901) We get wrong logdebug information when jump to the next locality level.

2016-03-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13901.
---
   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 11719
[https://github.com/apache/spark/pull/11719]

> We get wrong logdebug information when jump to the next locality level.
> ---
>
> Key: SPARK-13901
> URL: https://issues.apache.org/jira/browse/SPARK-13901
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: yaoyin
>Assignee: yaoyin
>Priority: Trivial
> Fix For: 2.0.0, 1.6.2
>
>
> In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug 
> information when jump to the next locality level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14014) Replace existing analysis.Catalog with SessionCatalog

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202318#comment-15202318
 ] 

Apache Spark commented on SPARK-14014:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11836

> Replace existing analysis.Catalog with SessionCatalog
> -
>
> Key: SPARK-14014
> URL: https://issues.apache.org/jira/browse/SPARK-14014
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> As of this moment, there exist many catalogs in Spark. For Spark 2.0, we will 
> have two high level catalogs only: SessionCatalog and ExternalCatalog. 
> SessionCatalog (implemented in SPARK-13923) keeps track of temporary 
> functions and tables and delegates other operations to ExternalCatalog.
> At the same time, there's this legacy catalog called `analysis.Catalog` that 
> also tracks temporary functions and tables. The goal is to get rid of this 
> legacy catalog and replace it with SessionCatalog, which is the new thing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13905) Change signature of as.data.frame() to be consistent with the R base package

2016-03-19 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui updated SPARK-13905:

Description: (was: SparkR provides a method as.data.frame() to collect 
a SparkR DataFrame into a local data.frame. But it conflicts the methods with 
the same name in the R base package.

For example,
{code}
code as follows - 
countData <- matrix(1:100,ncol=4)
condition <- factor(c("A","A","B","B"))
dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)

Works if i dont initialize the sparkR environment. 
 if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives 
following error 

> dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~ 
> condition)
Error in DataFrame(colData, row.names = rownames(colData)) : 
  cannot coerce class "data.frame" to a DataFrame
{code}

The implementation of as.data.frame() in SparkR can be improved to avoid 
conflict with those in the R base package.)

> Change signature of as.data.frame() to be consistent with the R base package
> 
>
> Key: SPARK-13905
> URL: https://issues.apache.org/jira/browse/SPARK-13905
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13964) Feature hashing improvements

2016-03-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13964:
---
Priority: Minor  (was: Major)

> Feature hashing improvements
> 
>
> Key: SPARK-13964
> URL: https://issues.apache.org/jira/browse/SPARK-13964
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Investigate improvements to Spark ML feature hashing (see e.g. 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13963) Add binary toggle Param to ml.HashingTF

2016-03-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13963:
---
Assignee: Bryan Cutler

> Add binary toggle Param to ml.HashingTF
> ---
>
> Key: SPARK-13963
> URL: https://issues.apache.org/jira/browse/SPARK-13963
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Bryan Cutler
>Priority: Trivial
>
> It would be handy to add a binary toggle Param to {{HashingTF}}, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12789) Support order by position in SQL

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12789:

Description: 
This is to support order by position in SQL, e.g.

{noformat}
select c1, c2, c3 from tbl order by 1, 3
{noformat}

should be equivalent to

{noformat}
select c1, c2, c3 from tbl order by c1, c3
{noformat}




  was: Num in Order by is treated as constant expression at the moment. I guess 
it would be good to enable user to specify column by index which has been 
supported in Hive 0.11.0 and later. 


> Support order by position in SQL
> 
>
> Key: SPARK-12789
> URL: https://issues.apache.org/jira/browse/SPARK-12789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: zhichao-li
>Priority: Minor
>
> This is to support order by position in SQL, e.g.
> {noformat}
> select c1, c2, c3 from tbl order by 1, 3
> {noformat}
> should be equivalent to
> {noformat}
> select c1, c2, c3 from tbl order by c1, c3
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14010) ColumnPruning is conflict with PushPredicateThroughProject

2016-03-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14010:
---
Description: ColumnPruning will insert a Project before Filter, but 
PushPredicateThroughProject will move the Filter before Project, they make the 
optimizer not stable.  (was: ColumnPruning will insert a Project before Filter, 
but )

> ColumnPruning is conflict with PushPredicateThroughProject
> --
>
> Key: SPARK-14010
> URL: https://issues.apache.org/jira/browse/SPARK-14010
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> ColumnPruning will insert a Project before Filter, but 
> PushPredicateThroughProject will move the Filter before Project, they make 
> the optimizer not stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13976) do not remove sub-queries added by user when generate SQL

2016-03-19 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13976:
---

 Summary: do not remove sub-queries added by user when generate SQL
 Key: SPARK-13976
 URL: https://issues.apache.org/jira/browse/SPARK-13976
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13951) PySpark ml.pipeline support export/import - nested Piplines

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13951:


Assignee: Apache Spark

> PySpark ml.pipeline support export/import - nested Piplines
> ---
>
> Key: SPARK-13951
> URL: https://issues.apache.org/jira/browse/SPARK-13951
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13957) Support group by ordinal in SQL

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203089#comment-15203089
 ] 

Apache Spark commented on SPARK-13957:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11846

> Support group by ordinal in SQL
> ---
>
> Key: SPARK-13957
> URL: https://issues.apache.org/jira/browse/SPARK-13957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is to support order by position in SQL, e.g.
> {noformat}
> select c1, c2, c3, sum(*) from tbl group by by 1, 3, c4
> {noformat}
> should be equivalent to
> {noformat}
> select c1, c2, c3, sum(*) from tbl order by c1, c3, c4
> {noformat}
> We only convert integer literals (not foldable expressions).
> For positions that are aggregate functions, an analysis exception should be 
> thrown, e.g. in postgres;
> {noformat}
> rxin=# select 'one', 'two', count(*) from r1 group by 1, 3;
> ERROR:  aggregate functions are not allowed in GROUP BY
> LINE 1: select 'one', 'two', count(*) from r1 group by 1, 3;
>  ^
> {noformat}
> This should be controlled by config option spark.sql.groupByOrdinal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13957) Support group by ordinal in SQL

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13957:


Assignee: (was: Apache Spark)

> Support group by ordinal in SQL
> ---
>
> Key: SPARK-13957
> URL: https://issues.apache.org/jira/browse/SPARK-13957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is to support order by position in SQL, e.g.
> {noformat}
> select c1, c2, c3, sum(*) from tbl group by by 1, 3, c4
> {noformat}
> should be equivalent to
> {noformat}
> select c1, c2, c3, sum(*) from tbl order by c1, c3, c4
> {noformat}
> We only convert integer literals (not foldable expressions).
> For positions that are aggregate functions, an analysis exception should be 
> thrown, e.g. in postgres;
> {noformat}
> rxin=# select 'one', 'two', count(*) from r1 group by 1, 3;
> ERROR:  aggregate functions are not allowed in GROUP BY
> LINE 1: select 'one', 'two', count(*) from r1 group by 1, 3;
>  ^
> {noformat}
> This should be controlled by config option spark.sql.groupByOrdinal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13957) Support group by ordinal in SQL

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13957:


Assignee: Apache Spark

> Support group by ordinal in SQL
> ---
>
> Key: SPARK-13957
> URL: https://issues.apache.org/jira/browse/SPARK-13957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> This is to support order by position in SQL, e.g.
> {noformat}
> select c1, c2, c3, sum(*) from tbl group by by 1, 3, c4
> {noformat}
> should be equivalent to
> {noformat}
> select c1, c2, c3, sum(*) from tbl order by c1, c3, c4
> {noformat}
> We only convert integer literals (not foldable expressions).
> For positions that are aggregate functions, an analysis exception should be 
> thrown, e.g. in postgres;
> {noformat}
> rxin=# select 'one', 'two', count(*) from r1 group by 1, 3;
> ERROR:  aggregate functions are not allowed in GROUP BY
> LINE 1: select 'one', 'two', count(*) from r1 group by 1, 3;
>  ^
> {noformat}
> This should be controlled by config option spark.sql.groupByOrdinal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13946) PySpark DataFrames allows you to silently use aggregate expressions derived from different table expressions

2016-03-19 Thread Wes McKinney (JIRA)
Wes McKinney created SPARK-13946:


 Summary: PySpark DataFrames allows you to silently use aggregate 
expressions derived from different table expressions
 Key: SPARK-13946
 URL: https://issues.apache.org/jira/browse/SPARK-13946
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Wes McKinney


In my opinion, this code should raise an exception rather than silently 
discarding the predicate:

{code}
import numpy as np
import pandas as pd

df = pd.DataFrame({'foo': np.random.randn(100),
   'bar': np.random.randn(100)})

sdf = sqlContext.createDataFrame(df)

sdf2 = sdf[sdf.bar > 0]
sdf.agg(F.count(sdf2.foo)).show()

+--+
|count(foo)|
+--+
|   100|
+--+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-19 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-13932:
-
Affects Version/s: 2.0.0

> CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
> --
>
> Key: SPARK-13932
> URL: https://issues.apache.org/jira/browse/SPARK-13932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Tien-Dung LE
>
> A complex aggregate query using condition in the aggregate function and GROUP 
> BY HAVING clause raises an exception. This issue only happens in Spark 
> version 1.6.+ but not in Spark 1.5.+.
> Here is a typical error message {code}
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
> {code}
> Here is a code snippet to re-produce the error in a spark-shell session:
> {code}
> import sqlContext.implicits._
> case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
>   b: Int = (math.random*1e3).toInt,
>   n: Int = (math.random*1e3).toInt,
>   m: Double = (math.random*1e3))
> val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
> val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )
> df.registerTempTable( "toto" )
> val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS 
> k3, GROUPING__ID"
> val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS 
> k2, SUM(m) AS k3, GROUPING__ID"
> val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
> val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"
> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> {code}
> And here is the full log
> {code}
> scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" )
> res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" )
> res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 

[jira] [Commented] (SPARK-13950) Generate code for sort merge left/right outer join

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198162#comment-15198162
 ] 

Apache Spark commented on SPARK-13950:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11771

> Generate code for sort merge left/right outer join
> --
>
> Key: SPARK-13950
> URL: https://issues.apache.org/jira/browse/SPARK-13950
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text

2016-03-19 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-13982:
--
Summary: SparkR - KMeans predict: Output column name of features is an 
unclear, automatic genetared text  (was: SparkR - KMeans predict: Output column 
name of features is an unclear, automatically genetared text)

> SparkR - KMeans predict: Output column name of features is an unclear, 
> automatic genetared text
> ---
>
> Key: SPARK-13982
> URL: https://issues.apache.org/jira/browse/SPARK-13982
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently KMean-predict's features' output column name is set to something 
> like this: "vecAssembler_522ba59ea239__output", which is the default output 
> column name of the "VectorAssembler".
> Example: 
> showDF(predict(model, training)) shows something like this:
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, 
> prediction:int]
> This name is automatically generated and very unclear from user perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13858:


Assignee: Apache Spark

> TPCDS query 21 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13858
> URL: https://issues.apache.org/jira/browse/SPARK-13858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>Assignee: Apache Spark
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 21 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
> other rows are missing as well.
> Actual results:
> {noformat}
> [null,AABD,2565,1922]
> [null,AAHD,2956,2052]
> [null,AALA,2042,1793]
> [null,ACGC,2373,1771]
> [null,ACKC,2321,1856]
> [null,ACOB,1504,1397]
> [null,ADKB,1820,2163]
> [null,AEAD,2631,1965]
> [null,AEOC,1659,1798]
> [null,AFAC,1965,1705]
> [null,AFAD,1769,1313]
> [null,AHDE,2700,1985]
> [null,AHHA,1578,1082]
> [null,AIEC,1756,1804]
> [null,AIMC,3603,2951]
> [null,AJAC,2109,1989]
> [null,AJKB,2573,3540]
> [null,ALBE,3458,2992]
> [null,ALCE,1720,1810]
> [null,ALEC,2569,1946]
> [null,ALNB,2552,1750]
> [null,ANFE,2022,2269]
> [null,AOIB,2982,2540]
> [null,APJB,2344,2593]
> [null,BAPD,2182,2787]
> [null,BDCE,2844,2069]
> [null,BDDD,2417,2537]
> [null,BDJA,1584,1666]
> [null,BEOD,2141,2649]
> [null,BFCC,2745,2020]
> [null,BFMB,1642,1364]
> [null,BHPC,1923,1780]
> [null,BIDB,1956,2836]
> [null,BIGB,2023,2344]
> [null,BIJB,1977,2728]
> [null,BJFE,1891,2390]
> [null,BLDE,1983,1797]
> [null,BNID,2485,2324]
> [null,BNLD,2385,2786]
> [null,BOMB,2291,2092]
> [null,CAAA,2233,2560]
> [null,CBCD,1540,2012]
> [null,CBIA,2394,2122]
> [null,CBPB,1790,1661]
> [null,CCMD,2654,2691]
> [null,CDBC,1804,2072]
> [null,CFEA,1941,1567]
> [null,CGFD,2123,2265]
> [null,CHPC,2933,2174]
> [null,CIGD,2618,2399]
> [null,CJCB,2728,2367]
> [null,CJLA,1350,1732]
> [null,CLAE,2578,2329]
> [null,CLGA,1842,1588]
> [null,CLLB,3418,2657]
> [null,CLOB,3115,2560]
> [null,CMAD,1991,2243]
> [null,CMJA,1261,1855]
> [null,CMLA,3288,2753]
> [null,CMPD,1320,1676]
> [null,CNGB,2340,2118]
> [null,CNHD,3519,3348]
> [null,CNPC,2561,1948]
> [null,DCPC,2664,2627]
> [null,DDHA,1313,1926]
> [null,DDND,1109,835]
> [null,DEAA,2141,1847]
> [null,DEJA,3142,2723]
> [null,DFKB,1470,1650]
> [null,DGCC,2113,2331]
> [null,DGFC,2201,2928]
> [null,DHPA,2467,2133]
> [null,DMBA,3085,2087]
> [null,DPAB,3494,3081]
> [null,EAEC,2133,2148]
> [null,EAPA,1560,1275]
> [null,ECGC,2815,3307]
> [null,EDPD,2731,1883]
> [null,EEEC,2024,1902]
> [null,EEMC,2624,2387]
> [null,EFFA,2047,1878]
> [null,EGJA,2403,2633]
> [null,EGMA,2784,2772]
> [null,EGOC,2389,1753]
> [null,EHFD,1940,1420]
> [null,EHLB,2320,2057]
> [null,EHPA,1898,1853]
> [null,EIPB,2930,2326]
> [null,EJAE,2582,1836]
> [null,EJIB,2257,1681]
> [null,EJJA,2791,1941]
> [null,EJJD,3410,2405]
> [null,EJNC,2472,2067]
> [null,EJPD,1219,1229]
> [null,EKEB,2047,1713]
> [null,EMEA,2502,1897]
> [null,EMKC,2362,2042]
> [null,ENAC,2011,1909]
> [null,ENFB,2507,2162]
> [null,ENOD,3371,2709]
> {noformat}
> Expected results:
> {noformat}
> +--+--++---+
> | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
> +--+--++---+
> | Bad cards must make. | AACD |   1889 |  2168 |
> | Bad cards must make. | AAHD |   2739 |   

[jira] [Updated] (SPARK-7992) Hide private classes/objects in in generated Java API doc

2016-03-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7992:
-
Assignee: (was: Xiangrui Meng)

> Hide private classes/objects in in generated Java API doc
> -
>
> Key: SPARK-7992
> URL: https://issues.apache.org/jira/browse/SPARK-7992
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>
> After SPARK-5610, we found that private classes/objects still show up in the 
> generated Java API doc, e.g., under `org.apache.spark.api.r` we can see
> {code}
> BaseRRDD
> PairwiseRRDD
> RRDD
> SpecialLengths
> StringRRDD
> {code}
> We should update genjavadoc to hide those private classes/methods. The best 
> approach is to find a good mapping from Scala private to Java, and merge it 
> into the main genjavadoc repo. A WIP PR is at 
> https://github.com/typesafehub/genjavadoc/pull/47.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13038) PySpark ml.pipeline support export/import - non-nested Pipelines

2016-03-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13038:
--
Summary: PySpark ml.pipeline support export/import - non-nested Pipelines  
(was: PySpark ml.pipeline support export/import)

> PySpark ml.pipeline support export/import - non-nested Pipelines
> 
>
> Key: SPARK-13038
> URL: https://issues.apache.org/jira/browse/SPARK-13038
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Xusen Yin
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/pipeline.py. Please refer the implementation 
> at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14009) Fail the tests if the any catalyst rule reach max number of iteration.

2016-03-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14009:
--

 Summary: Fail the tests if the any catalyst rule reach max number 
of iteration.
 Key: SPARK-14009
 URL: https://issues.apache.org/jira/browse/SPARK-14009
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Recently some catalyst rule becoming not stable (conflict with each other), or 
continue adding stuff into query plan, we should detect this early by fail the 
tests if any rule reach max number of iterations (200).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13776) Web UI is not available after ./sbin/start-master.sh

2016-03-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13776.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11615
[https://github.com/apache/spark/pull/11615]

> Web UI is not available after ./sbin/start-master.sh
> 
>
> Key: SPARK-13776
> URL: https://issues.apache.org/jira/browse/SPARK-13776
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
> Environment: Solaris 11.3, Oracle SPARC T-5 8 with 1024 hardware 
> threads
>Reporter: Erik O'Shaughnessy
>Priority: Minor
> Fix For: 2.0.0
>
>
> The Apache Spark Web UI fails to become available after starting a Spark 
> master in stand-alone mode:
> $ ./sbin/start-master.sh
> The log file contains the following:
> {quote}
> cat spark-hadoop-org.apache.spark.deploy.master.Master-1-t5-8-002.out
> Spark Command: /usr/java/bin/java -cp 
> /usr/local/spark-1.6.0_nohadoop/conf/:/usr/local/spark-1.6.0_nohadoop/assembly/target/scala-2.10/spark-assembly-1.6.0-hadoop2.2.0.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-core-3.2.10.jar
>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip t5-8-002 --port 
> 7077 --webui-port 8080
> 
> 16/01/27 12:00:42 WARN AbstractConnector: insufficient threads configured for 
> SelectChannelConnector@0.0.0.0:8080
> 16/01/27 12:00:42 WARN AbstractConnector: insufficient threads configured for 
> SelectChannelConnector@t5-8-002:6066
> {quote}
> I did some poking around and it seems that message is coming from Jetty and 
> indicates a mismatch between Jetty's default maxThreads configuration and the 
> actual number of CPUs available on the hardware (1024). I was not able to 
> find a way to successfully change Jetty's configuration at run-time. 
> Our work around was to disable CPUs until the WARN messages did not occur in 
> the log file, which was when NCPUs = 504. 
> I don't know for certain that this is isn't a known problem in Jetty from 
> looking at their bug reports, but I wasn't able to locate a Jetty issue that 
> described this problem.
> While not specifically an Apache Spark problem, I thought documenting it 
> would at least be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13461) Duplicated example code merge and cleanup

2016-03-19 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203076#comment-15203076
 ] 

Xusen Yin commented on SPARK-13461:
---

Yes we'll delete it.

> Duplicated example code merge and cleanup
> -
>
> Key: SPARK-13461
> URL: https://issues.apache.org/jira/browse/SPARK-13461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Merge duplicated code after we finishing the example code substitution.
> Duplications include:
> * JavaTrainValidationSplitExample 
> * TrainValidationSplitExample
> * Random data generation in mllib-statistics.md need to remove "-" in each 
> line.
> * Others can be added here ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13937) PySpark ML JavaWrapper, variable _java_obj should not be static

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197902#comment-15197902
 ] 

Apache Spark commented on SPARK-13937:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/11767

> PySpark ML JavaWrapper, variable _java_obj should not be static
> ---
>
> Key: SPARK-13937
> URL: https://issues.apache.org/jira/browse/SPARK-13937
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> In PySpark ML wrapper.py, the abstract class {{JavaWrapper}} has a static 
> variable {{_java_obj}}.  This is meant to hold an instance of a companion 
> Java object.  It seems as though it was made static accidentally because it 
> is never used, and all assignments done in derived classes are done to a 
> member variable with {{self._java_obj}}.  This does not cause any problems 
> with the current functionality, but it should be changed so as not to cause 
> any confusion and misuse in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-19 Thread Roy Cecil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roy Cecil closed SPARK-13821.
-
Resolution: Not A Problem

> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {noformat}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13761) Deprecate validateParams

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200097#comment-15200097
 ] 

Apache Spark commented on SPARK-13761:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/11790

> Deprecate validateParams
> 
>
> Key: SPARK-13761
> URL: https://issues.apache.org/jira/browse/SPARK-13761
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Deprecate validateParams() method here: 
> [https://github.com/apache/spark/blob/035d3acdf3c1be5b309a861d5c5beb803b946b5e/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L553]
> Move all functionality in overridden methods to transformSchema().
> Check docs to make sure they indicate complex Param interaction checks should 
> be done in transformSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13935) Other clients' connection hang up when someone do huge load

2016-03-19 Thread Tao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197554#comment-15197554
 ] 

Tao Wang edited comment on SPARK-13935 at 3/16/16 3:51 PM:
---

[~marmbrus] [~liancheng] [~chenghao] 


was (Author: wangtao):
[~marmbrus][~liancheng][~chenghao] 

> Other clients' connection hang up when someone do huge load
> ---
>
> Key: SPARK-13935
> URL: https://issues.apache.org/jira/browse/SPARK-13935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.2, 1.6.0, 1.6.1
>Reporter: Tao Wang
>Priority: Critical
>
> We run a sql like "insert overwrite table store_returns partition 
> (sr_returned_date) select xx" using beeline then it will block other 
> beeline connection while invoke the Hive method via 
> "ClientWrapper.loadDynamicPartitions".
> The reason is that "withHiveState" will lock "clientLoader". Sadly when a new 
> client comes, it will invoke "setConf" methods which is also sychronized with 
> "clientLoader".
> So the problem is that if the first sql took very long time to run, then all 
> other client could not connect to thrift server successfully.
> We tested on release 1.5.1. not sure if latest release has same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13983) HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since 1.6 version (both multi-session and single session)

2016-03-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13983:
---
Assignee: Cheng Lian

> HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since 
> 1.6 version (both multi-session and single session)
> --
>
> Key: SPARK-13983
> URL: https://issues.apache.org/jira/browse/SPARK-13983
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
> Environment: ubuntu, spark 1.6.0 standalone, spark 1.6.1 standalone
> (tried spark branch-1.6 snapshot as well)
> compiled with scala 2.10.5 and hadoop 2.6
> (-Phadoop-2.6 -Psparkr -Phive -Phive-thriftserver)
>Reporter: Teng Qiu
>Assignee: Cheng Lian
>
> HiveThriftServer2 should be able to get "\--hiveconf" or ''\-\-hivevar" 
> variables from JDBC client, either from command line parameter of beeline, 
> such as
> {{beeline --hiveconf spark.sql.shuffle.partitions=3 --hivevar 
> db_name=default}}
> or from JDBC connection string, like
> {{jdbc:hive2://localhost:1?spark.sql.shuffle.partitions=3#db_name=default}}
> this worked in spark version 1.5.x, but after upgraded to 1.6, it doesn't 
> work.
> to reproduce this issue, try to connect to HiveThriftServer2 with beeline:
> {code}
> bin/beeline -u jdbc:hive2://localhost:1 \
> --hiveconf spark.sql.shuffle.partitions=3 \
> --hivevar db_name=default
> {code}
> or
> {code}
> bin/beeline -u 
> jdbc:hive2://localhost:1?spark.sql.shuffle.partitions=3#db_name=default
> {code}
> will get following results:
> {code}
> 0: jdbc:hive2://localhost:1> set spark.sql.shuffle.partitions;
> +---++--+
> |  key  | value  |
> +---++--+
> | spark.sql.shuffle.partitions  | 200|
> +---++--+
> 1 row selected (0.192 seconds)
> 0: jdbc:hive2://localhost:1> use ${db_name};
> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> '$' '{' 'db_name' in switch database statement; line 1 pos 4 (state=,code=0)
> {code}
> -
> but this bug does not affect current versions of spark-sql CLI, following 
> commands works:
> {code}
> bin/spark-sql --master local[2] \
>   --hiveconf spark.sql.shuffle.partitions=3 \
>   --hivevar db_name=default
> spark-sql> set spark.sql.shuffle.partitions
> spark.sql.shuffle.partitions   3
> Time taken: 1.037 seconds, Fetched 1 row(s)
> spark-sql> use ${db_name};
> OK
> Time taken: 1.697 seconds
> {code}
> so I think it may caused by this change: 
> https://github.com/apache/spark/pull/8909 ( [SPARK-10810] [SPARK-10902] [SQL] 
> Improve session management in SQL )
> perhaps by calling {{hiveContext.newSession}}, the variables from 
> {{sessionConf}} were not loaded into the new session? 
> (https://github.com/apache/spark/pull/8909/files#diff-8f8b7f4172e8a07ff20a4dbbbcc57b1dR69)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12789) Support order by position in SQL

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12789:

Summary: Support order by position in SQL  (was: Support order by position)

> Support order by position in SQL
> 
>
> Key: SPARK-12789
> URL: https://issues.apache.org/jira/browse/SPARK-12789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: zhichao-li
>Priority: Minor
>
>  Num in Order by is treated as constant expression at the moment. I guess it 
> would be good to enable user to specify column by index which has been 
> supported in Hive 0.11.0 and later. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13960) JAR/File HTTP Server doesn't respect "spark.driver.host" and there is no "spark.fileserver.host" option

2016-03-19 Thread Ilya Ostrovskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200690#comment-15200690
 ] 

Ilya Ostrovskiy commented on SPARK-13960:
-

exporting the SPARK_LOCAL_IP environment variable appears to workaround this 
issue, however, setting SPARK_PUBLIC_DNS does not work, despite the 
documentation stating that the latter is "[the h]ostname your Spark program 
will advertise to other machines."

> JAR/File HTTP Server doesn't respect "spark.driver.host" and there is no 
> "spark.fileserver.host" option
> ---
>
> Key: SPARK-13960
> URL: https://issues.apache.org/jira/browse/SPARK-13960
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.1
> Environment: Any system with more than one IP address
>Reporter: Ilya Ostrovskiy
>
> There is no option to specify which hostname/IP address the jar/file server 
> listens on, and rather than using "spark.driver.host" if specified, the 
> jar/file server will listen on the system's primary IP address. This is an 
> issue when submitting an application in client mode on a machine with two 
> NICs connected to two different networks. 
> Steps to reproduce:
> 1) Have a cluster in a remote network, whose master is on 192.168.255.10
> 2) Have a machine at another location, with a "primary" IP address of 
> 192.168.1.2, connected to the "remote network" as well, with the IP address 
> 192.168.255.250. Let's call this the "client machine".
> 3) Ensure every machine in the spark cluster at the remote location can ping 
> 192.168.255.250 and reach the client machine via that address.
> 4) On the client: 
> {noformat}
> spark-submit --deploy-mode client --conf "spark.driver.host=192.168.255.250" 
> --master spark://192.168.255.10:7077 --class  
>  
> {noformat}
> 5) Navigate to http://192.168.255.250:4040/ and ensure that executors from 
> the remote cluster have found the driver on the client machine
> 6) Navigate to http://192.168.255.250:4040/environment/, and scroll to the 
> bottom
> 7) Observe that the JAR you specified in Step 4 will be listed under 
> http://192.168.1.2:/jars/.jar
> 8) Enjoy this stack trace periodically appearing on the client machine when 
> the nodes in the remote cluster cant connect to 192.168.1.2 to get your JAR
> {noformat}
> 16/03/17 03:25:55 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 5, 
> 192.168.255.11): java.net.SocketTimeoutException: connect timed out
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
> at sun.net.www.http.HttpClient.(HttpClient.java:211)
> at sun.net.www.http.HttpClient.New(HttpClient.java:308)
> at sun.net.www.http.HttpClient.New(HttpClient.java:326)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
> at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:588)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
> at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
> at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
> at 
> 

[jira] [Updated] (SPARK-12789) Support order by position in SQL

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12789:

Description: 
This is to support order by position in SQL, e.g.

{noformat}
select c1, c2, c3 from tbl order by 1 desc, 3
{noformat}

should be equivalent to

{noformat}
select c1, c2, c3 from tbl order by c1 desc, c3
{noformat}

We only convert integer literals (not foldable expressions).

We should make sure this also works with select *.


  was:
This is to support order by position in SQL, e.g.

{noformat}
select c1, c2, c3 from tbl order by 1 desc, 3
{noformat}

should be equivalent to

{noformat}
select c1, c2, c3 from tbl order by c1 desc, c3
{noformat}

We should make sure this also works with select *.



> Support order by position in SQL
> 
>
> Key: SPARK-12789
> URL: https://issues.apache.org/jira/browse/SPARK-12789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: zhichao-li
>Priority: Minor
>
> This is to support order by position in SQL, e.g.
> {noformat}
> select c1, c2, c3 from tbl order by 1 desc, 3
> {noformat}
> should be equivalent to
> {noformat}
> select c1, c2, c3 from tbl order by c1 desc, c3
> {noformat}
> We only convert integer literals (not foldable expressions).
> We should make sure this also works with select *.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13961) spark.ml ChiSqSelector should support other numeric types for label

2016-03-19 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-13961:
--

 Summary: spark.ml ChiSqSelector should support other numeric types 
for label
 Key: SPARK-13961
 URL: https://issues.apache.org/jira/browse/SPARK-13961
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Nick Pentreath
Assignee: Benjamin Fradet
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197584#comment-15197584
 ] 

Apache Spark commented on SPARK-13928:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11764

> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-19 Thread Roy Cecil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201506#comment-15201506
 ] 

Roy Cecil edited comment on SPARK-13821 at 3/18/16 2:09 PM:


Dilip, Removed the extra comma from the query and it compiles. Since I am 
really comparing standard SQL  , I just want to ensure that this is not a 
violation of ANSI standard. Let me explore a little bit more.


was (Author: roycecil):
Dilip, Removed the query and it compiles. Since I am really comparing standard 
SQL  , I just want to ensure that this is not a violation of ANSI standard. Let 
me explore a little bit more.

> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {noformat}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13993) PySpark ml.feature.RFormula/RFormulaModel support export/import

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13993:


Assignee: Apache Spark

> PySpark ml.feature.RFormula/RFormulaModel support export/import
> ---
>
> Key: SPARK-13993
> URL: https://issues.apache.org/jira/browse/SPARK-13993
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Assignee: Apache Spark
>Priority: Minor
>
> Add save/load for RFormula and its model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13733) Support initial weight distribution in personalized PageRank

2016-03-19 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198327#comment-15198327
 ] 

Gayathri Murali commented on SPARK-13733:
-

[~mengxr] Should the rest of the vertices also be set to resetProb(which is 
0.25 initial weight) ?

> Support initial weight distribution in personalized PageRank
> 
>
> Key: SPARK-13733
> URL: https://issues.apache.org/jira/browse/SPARK-13733
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Xiangrui Meng
>
> It would be nice to support personalized PageRank with an initial weight 
> distribution besides a single vertex. It should be easy to modify the current 
> implementation to add this support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13034) PySpark ml.classification support export/import

2016-03-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-13034.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11707
[https://github.com/apache/spark/pull/11707]

> PySpark ml.classification support export/import
> ---
>
> Key: SPARK-13034
> URL: https://issues.apache.org/jira/browse/SPARK-13034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/classification.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14005) Make RDD more compatible with Scala's collection

2016-03-19 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203056#comment-15203056
 ] 

zhengruifeng commented on SPARK-14005:
--

ok, plz close this jira.

> Make RDD more compatible with Scala's collection 
> -
>
> Key: SPARK-14005
> URL: https://issues.apache.org/jira/browse/SPARK-14005
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Reporter: zhengruifeng
>Priority: Trivial
>
> How about implementing some more methods for RDD to make it more compatible 
> with Scala's collection?
> Such as:
> nonEmpty, slice, takeRight, contains, last, reverse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13968) Use MurmurHash3 for hashing String features

2016-03-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202003#comment-15202003
 ] 

Joseph K. Bradley commented on SPARK-13968:
---

I'm going to close this in favor of the older ticket.  I'll make the old ticket 
a subtask.  But I agree it'd be good to switch.

> Use MurmurHash3 for hashing String features
> ---
>
> Key: SPARK-13968
> URL: https://issues.apache.org/jira/browse/SPARK-13968
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Assignee: Yanbo Liang
>Priority: Minor
>
> Typically feature hashing is done on strings, i.e. feature names (or in the 
> case of raw feature indexes, either the string representation of the 
> numerical index can be used, or the index used "as-is" and not hashed).
> It is common to use a well-distributed hash function such as MurmurHash3. 
> This is the case in e.g. 
> [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].
> Currently Spark's {{HashingTF}} uses the object's hash code. Look at using 
> MurmurHash3 (at least for {{String}} which is the common case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-03-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201993#comment-15201993
 ] 

Joseph K. Bradley commented on SPARK-13629:
---

[~mlnick] Thanks for handling these count/hashing improvements!  This PR + the 
other JIRAs sound good to me.

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11319) PySpark silently accepts null values in non-nullable DataFrame fields.

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11319:


Assignee: (was: Apache Spark)

> PySpark silently accepts null values in non-nullable DataFrame fields.
> --
>
> Key: SPARK-11319
> URL: https://issues.apache.org/jira/browse/SPARK-11319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column 
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", 
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13948) MiMa Check should catch if the visibility change to `private`

2016-03-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13948:
---
Component/s: Project Infra

>   MiMa Check should catch if the visibility change to `private`
> --
>
> Key: SPARK-13948
> URL: https://issues.apache.org/jira/browse/SPARK-13948
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Reporter: Dongjoon Hyun
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 2.0.0
>
>
> `GenerateMIMAIgnore.scala` makes `.generated-mima-class-excludes` from the 
> current code having `private` class. As a result, it ignores the case : 
> visibility goes from public into private. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200516#comment-15200516
 ] 

Xiao Li commented on SPARK-13865:
-

This is the same as the https://issues.apache.org/jira/browse/SPARK-13859. The 
query used in this JIRA is different from the original one in TPCDS, as posted 
below. Therefore, the results are different from the official result. 

[~jfc...@us.ibm.com] You need to rerun it by using the standard query. Thanks! 
Spark does support Except. You do not need to change it. 

{code}
select count(*) 
from ((select distinct c_last_name, c_first_name, d_date
   from store_sales, date_dim, customer
   where store_sales.ss_sold_date_sk = date_dim.d_date_sk
 and store_sales.ss_customer_sk = customer.c_customer_sk
 and d_month_seq between [DMS] and [DMS]+11)
   except
  (select distinct c_last_name, c_first_name, d_date
   from catalog_sales, date_dim, customer
   where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
 and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
 and d_month_seq between [DMS] and [DMS]+11)
   except
  (select distinct c_last_name, c_first_name, d_date
   from web_sales, date_dim, customer
   where web_sales.ws_sold_date_sk = date_dim.d_date_sk
 and web_sales.ws_bill_customer_sk = customer.c_customer_sk
 and d_month_seq between [DMS] and [DMS]+11)
) cool_cust
;
{code}


> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13972) hive tests should fail if SQL generation failed

2016-03-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13972:
---
Assignee: Wenchen Fan

> hive tests should fail if SQL generation failed
> ---
>
> Key: SPARK-13972
> URL: https://issues.apache.org/jira/browse/SPARK-13972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13776) Web UI is not available after ./sbin/start-master.sh

2016-03-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13776:
--
Assignee: Shixiong Zhu

> Web UI is not available after ./sbin/start-master.sh
> 
>
> Key: SPARK-13776
> URL: https://issues.apache.org/jira/browse/SPARK-13776
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
> Environment: Solaris 11.3, Oracle SPARC T-5 8 with 1024 hardware 
> threads
>Reporter: Erik O'Shaughnessy
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> The Apache Spark Web UI fails to become available after starting a Spark 
> master in stand-alone mode:
> $ ./sbin/start-master.sh
> The log file contains the following:
> {quote}
> cat spark-hadoop-org.apache.spark.deploy.master.Master-1-t5-8-002.out
> Spark Command: /usr/java/bin/java -cp 
> /usr/local/spark-1.6.0_nohadoop/conf/:/usr/local/spark-1.6.0_nohadoop/assembly/target/scala-2.10/spark-assembly-1.6.0-hadoop2.2.0.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/usr/local/spark-1.6.0_nohadoop/lib_managed/jars/datanucleus-core-3.2.10.jar
>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip t5-8-002 --port 
> 7077 --webui-port 8080
> 
> 16/01/27 12:00:42 WARN AbstractConnector: insufficient threads configured for 
> SelectChannelConnector@0.0.0.0:8080
> 16/01/27 12:00:42 WARN AbstractConnector: insufficient threads configured for 
> SelectChannelConnector@t5-8-002:6066
> {quote}
> I did some poking around and it seems that message is coming from Jetty and 
> indicates a mismatch between Jetty's default maxThreads configuration and the 
> actual number of CPUs available on the hardware (1024). I was not able to 
> find a way to successfully change Jetty's configuration at run-time. 
> Our work around was to disable CPUs until the WARN messages did not occur in 
> the log file, which was when NCPUs = 504. 
> I don't know for certain that this is isn't a known problem in Jetty from 
> looking at their bug reports, but I wasn't able to locate a Jetty issue that 
> described this problem.
> While not specifically an Apache Spark problem, I thought documenting it 
> would at least be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2016-03-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10788.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9474
[https://github.com/apache/spark/pull/9474]

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.0.0
>
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12719) SQL generation support for generators (including UDTF)

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200419#comment-15200419
 ] 

Apache Spark commented on SPARK-12719:
--

User 'yy2016' has created a pull request for this issue:
https://github.com/apache/spark/pull/11795

> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13461) Duplicated example code merge and cleanup

2016-03-19 Thread Gabor Liptak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203041#comment-15203041
 ] 

Gabor Liptak commented on SPARK-13461:
--

[~yinxusen] 
{{examples/src/main/scala/org/apache/spark/examples/ml/TrainValidationSplitExample.scala}}
 doesn't seem to be referenced. Do you see it simply deleted? Thanks

> Duplicated example code merge and cleanup
> -
>
> Key: SPARK-13461
> URL: https://issues.apache.org/jira/browse/SPARK-13461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Merge duplicated code after we finishing the example code substitution.
> Duplications include:
> * JavaTrainValidationSplitExample 
> * TrainValidationSplitExample
> * Random data generation in mllib-statistics.md need to remove "-" in each 
> line.
> * Others can be added here ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13969) Extend input format that feature hashing can handle

2016-03-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202007#comment-15202007
 ] 

Joseph K. Bradley commented on SPARK-13969:
---

I think HashingTF could be extended to handle this in two steps:
* Handle more input types [SPARK-11107]
* Accept multiple input columns [SPARK-8418]

> Extend input format that feature hashing can handle
> ---
>
> Key: SPARK-13969
> URL: https://issues.apache.org/jira/browse/SPARK-13969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary, 
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can 
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this 
> way, feature hashing can operate as both "one-hot encoder" and "vector 
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by 
> {{HashingTF}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13960) HTTP-based JAR Server doesn't respect spark.driver.host and there is no "spark.fileserver.host" option

2016-03-19 Thread Ilya Ostrovskiy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ostrovskiy updated SPARK-13960:

Description: 
There is no option to specify which hostname/IP address the jar/file server 
listens on, and rather than using "spark.driver.host" if specified, the 
jar/file server will listen on the system's primary IP address. This is an 
issue when submitting an application in client mode on a machine with two NICs 
connected to two different networks. 

Steps to reproduce:

1) Have a cluster in a remote network, whose master is on 192.168.255.10
2) Have a machine at another location, with a "primary" IP address of 
192.168.1.2, connected to the "remote network" as well, with the IP address 
192.168.255.250. Let's call this the "client machine".
3) Ensure every machine in the spark cluster at the remote location can ping 
192.168.255.250 and reach the client machine via that address.
4) On the client: 
{noformat}
spark-submit --deploy-mode client --conf "spark.driver.host=192.168.255.250" 
--master spark://192.168.255.10:7077 --class  
 
{noformat}
5) Navigate to http://192.168.255.250:4040/ and ensure that executors from the 
remote cluster have found the driver on the client machine
6) Navigate to http://192.168.255.250:4040/environment/, and scroll to the 
bottom
7) Observe that the JAR you specified in Step 4 will be listed under 
http://192.168.1.2:/jars/.jar
8) Enjoy this stack trace periodically appearing on the client machine when the 
nodes in the remote cluster cant connect to 192.168.1.2 to get your JAR
{noformat}
16/03/17 03:25:55 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 5, 
192.168.255.11): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:588)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

  was:
There is no option to specify which hostname/IP address the jar/file server 
listens on, and rather than using "spark.driver.host" if specified, the 
jar/file server will listen on the system's primary IP address. This is an 
issue when submitting an application in client mode on a machine with two NICs 
connected to two different networks. 

Steps to reproduce:

1) Have a cluster in a remote network, whose master is on 192.168.255.10
2) Have a machine at another 

[jira] [Commented] (SPARK-13968) Use MurmurHash3 for hashing String features

2016-03-19 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200254#comment-15200254
 ] 

Nick Pentreath commented on SPARK-13968:


Sure, I will assign to you. But I'd like to get some thoughts from [~mengxr] 
and [~josephkb] about this and the umbrella for feature hashing improvements 
(especially around the API / transformer behaviour) before starting work on 
these tickets.

> Use MurmurHash3 for hashing String features
> ---
>
> Key: SPARK-13968
> URL: https://issues.apache.org/jira/browse/SPARK-13968
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Typically feature hashing is done on strings, i.e. feature names (or in the 
> case of raw feature indexes, either the string representation of the 
> numerical index can be used, or the index used "as-is" and not hashed).
> It is common to use a well-distributed hash function such as MurmurHash3. 
> This is the case in e.g. 
> [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].
> Currently Spark's {{HashingTF}} uses the object's hash code. Look at using 
> MurmurHash3 (at least for {{String}} which is the common case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13938) word2phrase feature created in ML

2016-03-19 Thread Steve Weng (JIRA)
Steve Weng created SPARK-13938:
--

 Summary: word2phrase feature created in ML
 Key: SPARK-13938
 URL: https://issues.apache.org/jira/browse/SPARK-13938
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Steve Weng
Priority: Critical


I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which 
transforms a sentence of words into one where certain individual consecutive 
words are concatenated by using a training model/estimator (e.g. "I went to New 
York" becomes "I went to new_york").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13958) Executor OOM due to unbounded growth of pointer array in Sorter

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13958:


Assignee: (was: Apache Spark)

> Executor OOM due to unbounded growth of pointer array in Sorter
> ---
>
> Key: SPARK-13958
> URL: https://issues.apache.org/jira/browse/SPARK-13958
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> While running a job we saw that the executors are OOMing because in 
> UnsafeExternalSorter's growPointerArrayIfNecessary function, we are just 
> growing the pointer array indefinitely. 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L292
> This is a regression introduced in PR- 
> https://github.com/apache/spark/pull/11095



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3

2016-03-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10574:
--
Assignee: Yanbo Liang

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>Priority: Critical
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13951) PySpark ml.pipeline support export/import - nested Piplines

2016-03-19 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-13951:
-

 Summary: PySpark ml.pipeline support export/import - nested 
Piplines
 Key: SPARK-13951
 URL: https://issues.apache.org/jira/browse/SPARK-13951
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13988) Large history files block new applications from showing up in History UI.

2016-03-19 Thread Parth Brahmbhatt (JIRA)
Parth Brahmbhatt created SPARK-13988:


 Summary: Large history files block new applications from showing 
up in History UI.
 Key: SPARK-13988
 URL: https://issues.apache.org/jira/browse/SPARK-13988
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: Parth Brahmbhatt


Some of our Spark users complain that their application was not showing up in 
history server UI. Our analysis suggests that this is a side effect of some 
application’s event log being too big. This is especially true for spark ML 
applications that may have lot of iterations but is applicable to other kind of 
spark jobs too. For example on my local machine just running the following 
generates an event log of size 80MB.

{code}
./spark-shell --master yarn --deploy-mode client --conf 
spark.eventLog.enabled=true --conf 
spark.eventLog.dir=hdfs://localhost:9000/tmp/spark-events
val words = sc.textFile(“test.txt”)
for(i <- 1 to 1) words.count
sc.close 
{code}

For one of our user this file was as big as 12GB. He was running logistic 
regression using spark ML. Given each application generates its own application 
event log and event logs are processed serially in a single thread, one huge 
application can result in lot of users not being able to view their application 
on the main UI. To overcome this issue I propose to make the replay execution 
multi threaded so a single large event log won’t block other applications from 
being rendered into UI. This still cannot solve the issue completely if there 
are too many large event logs but the alternatives I have considered (Read 
chunks from begin and end  to get Application Start and End event, Modify the 
event log format so it has this info in header or footer) are all more 
intrusive. 

In addition there are several other things we can do to improve History Server 
implementation. 
* During the log checker phase to identify application start and end time the 
replaying thread processes the whole event log and throws away all the info 
apart from application start and end event. This is pretty huge waste given as 
soon as a user clicks on the application we reprocess the same event log to get 
job/task details. We should either optimize the first level of parsing so it 
reads some chunks from beginning and end to identify the application level 
details or better yet cache the job/task level details when we process the file 
for the first time.
* On the details job page there is no pagination and we only show the last 1000 
job events when there are > 1000 job events. Granted when users have more than 
1K jobs they probably won't page through them but not even having that option 
is bad experience. Also if that page is paginated we could probably do away 
with partial processing of the event log until the user wants to view the next 
page. This can help in cases where processing really large files causes OOM 
issues as we will only be processing a subset of the file.
* On startup, the history server reprocesses the whole event log. For the top 
level application details, we could persist the processing results from the 
last run in a more compact and searchable format to improve the bootstrap time. 
This is briefly mentioned in SPARK-6951.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3

2016-03-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10574:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-13964

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13886) ArrayType of BinaryType not supported in Row.equals method

2016-03-19 Thread MahmoudHanafy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198763#comment-15198763
 ] 

MahmoudHanafy commented on SPARK-13886:
---

I think List extends Seq !!
In this case, How can you differentiate between:

1- ArrayType(ByteType) => Seq[Byte]
2- BinaryType => Array[Byte]

> ArrayType of BinaryType not supported in Row.equals method 
> ---
>
> Key: SPARK-13886
> URL: https://issues.apache.org/jira/browse/SPARK-13886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: MahmoudHanafy
>Priority: Minor
>
> There are multiple types that are supoprted by Spark SQL, One of them is 
> ArrayType(Seq) which can be of any element type
> So it can be BinaryType(Array\[Byte\])
> In equals method in Row class, there is no handling for ArrayType of 
> BinaryType.
> So for example:
> {code:xml}
> val a = Row( Seq( Array(1.toByte) ) )
> val b = Row( Seq( Array(1.toByte) ) )
> a.equals(b) // this will return false
> {code}
> Also, this doesn't work for MapType of BinaryType.
> {code:xml}
> val a = Row( Map(1 -> Array(1.toByte) ) )
> val b = Row( Map(1 -> Array(1.toByte) ) )
> a.equals(b) // this will return false
> {code}
> Question1: Can the key in MapType be of BinaryType ?
> Question2: Isn't there another way to handle BinaryType by using scala type 
> instead of Array ?
> I want to contribute by fixing this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-19 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203027#comment-15203027
 ] 

Cody Koeninger commented on SPARK-12177:


Unless I'm misunderstanding your point, those changes are all in my fork
already. Keeping a message handler for messageandmetadata doesn't make
sense. Backwards compatibility with the existing direct stream isn't really
workable.



> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13997) Use Hadoop 2.0 default value for compression in data sources

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13997:


Assignee: (was: Apache Spark)

> Use Hadoop 2.0 default value for compression in data sources
> 
>
> Key: SPARK-13997
> URL: https://issues.apache.org/jira/browse/SPARK-13997
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently, JSON, TEXT and CSV data sources use {{CompressionCodecs}} class to 
> set compression configurations via {{option("compress", "codec")}}.
> I made this uses Hadoop 1.x default value (block level compression). However, 
> the default value in Hadoop 2.x is record level compression as described in 
> [mapred-site.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml].
> Since it drops Hadoop 1.x, it will make sense to use Hadoop 2.x default 
> values.
> According to [Hadoop Definitive Guide 3th 
> edition|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch04.html],
>  it looks configurations for the unit of compression (record or block).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200637#comment-15200637
 ] 

JESSE CHEN commented on SPARK-13865:


This maybe a TPC toolkit issue. Will be looking into this with John on my team 
who is one of the TPC board member. 

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13995) Constraints should take care of Cast

2016-03-19 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-13995:
---

 Summary: Constraints should take care of Cast
 Key: SPARK-13995
 URL: https://issues.apache.org/jira/browse/SPARK-13995
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


We infer relative constraints from logical plan's expressions. However, we 
don't consider Cast expression now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13967) Add binary toggle Param to PySpark CountVectorizer

2016-03-19 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-13967:
--

 Summary: Add binary toggle Param to PySpark CountVectorizer
 Key: SPARK-13967
 URL: https://issues.apache.org/jira/browse/SPARK-13967
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Nick Pentreath
Priority: Minor


See SPARK-13629



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12719) SQL generation support for generators (including UDTF)

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199852#comment-15199852
 ] 

Apache Spark commented on SPARK-12719:
--

User 'yy2016' has created a pull request for this issue:
https://github.com/apache/spark/pull/11787

> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13960) HTTP-based JAR Server doesn't respect spark.driver.host and there is no "spark.fileserver.host" option

2016-03-19 Thread Ilya Ostrovskiy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ostrovskiy updated SPARK-13960:

Description: 
There is no option to specify which hostname/IP address the jar/file server 
listens on, and rather than using "spark.driver.host" if specified, the 
jar/file server will listen on the system's primary IP address. This is an 
issue when submitting an application in client mode on a machine with two NICs 
connected to two different networks. 

Steps to reproduce:

1) Have a cluster in a remote network, whose master is on 192.168.255.10
2) Have a machine at another location, with a "primary" IP address of 
192.168.1.2, connected to the "remote network" as well, with the IP address 
192.168.255.250. Let's call this the "client machine".
3) Ensure every machine in the spark cluster at the remote location can ping 
192.168.255.250 and reach the client machine via that address.
4) On the client: 
{noformat}
spark-submit --deploy-mode client --conf "spark.driver.host=192.168.255.250" 
--master spark://192.168.255.10:7077 --class  
 
{noformat}
5) Navigate to http://192.168.255.250:4040/ and ensure that executors from the 
remote cluster have found the driver on the client machine
6) Navigate to http://192.168.255.250:4040/environment/, and scroll to the 
bottom
7) Observe that the JAR you specified in Step 4 will be listed under 
http://192.168.1.2:/jars/.jar

  was:
There is no option to specify which hostname/IP address the jar/file server 
listens on, and rather than using "spark.driver.host" if specified, the 
jar/file server will listen on the system's primary IP address. This is an 
issue when submitting an application in client mode on a machine with two NICs 
connected to two different networks. 

Steps to reproduce:

1) Have a cluster in a remote network, whose master is on 192.168.255.10
2) Have a machine at another location, with a "primary" IP address of 
"192.168.1.2", connected to the "remote network" as well, with the IP address 
"192.168.255.250". Let's call this the "client machine".
3) Ensure every machine in the spark cluster at the remote location can ping 
"192.168.255.250" and reach the client machine via that address.
4) On the client: 
{noformat}
spark-submit --deploy-mode client --conf "spark.driver.host=192.168.255.250" 
--master spark://192.168.255.10:7077 --class  
 
{noformat}
5) Navigate to "http://192.168.255.250:4040/; and ensure that executors from 
the remote cluster have found the driver on the client machine
6) Navigate to "http://192.168.255.250:4040/environment/;, and scroll to the 
bottom
7) Observe that the JAR you specified in Step 4 will be listed under 
"http://192.168.1.2:/jars/.jar"
8) Grok source and documentation to see if there's any way to change that
9) Submit this issue


> HTTP-based JAR Server doesn't respect spark.driver.host and there is no 
> "spark.fileserver.host" option
> --
>
> Key: SPARK-13960
> URL: https://issues.apache.org/jira/browse/SPARK-13960
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.1
> Environment: Any system with more than one IP address
>Reporter: Ilya Ostrovskiy
>
> There is no option to specify which hostname/IP address the jar/file server 
> listens on, and rather than using "spark.driver.host" if specified, the 
> jar/file server will listen on the system's primary IP address. This is an 
> issue when submitting an application in client mode on a machine with two 
> NICs connected to two different networks. 
> Steps to reproduce:
> 1) Have a cluster in a remote network, whose master is on 192.168.255.10
> 2) Have a machine at another location, with a "primary" IP address of 
> 192.168.1.2, connected to the "remote network" as well, with the IP address 
> 192.168.255.250. Let's call this the "client machine".
> 3) Ensure every machine in the spark cluster at the remote location can ping 
> 192.168.255.250 and reach the client machine via that address.
> 4) On the client: 
> {noformat}
> spark-submit --deploy-mode client --conf "spark.driver.host=192.168.255.250" 
> --master spark://192.168.255.10:7077 --class  
>  
> {noformat}
> 5) Navigate to http://192.168.255.250:4040/ and ensure that executors from 
> the remote cluster have found the driver on the client machine
> 6) Navigate to http://192.168.255.250:4040/environment/, and scroll to the 
> bottom
> 7) Observe that the JAR you specified in Step 4 will be listed under 
> http://192.168.1.2:/jars/.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Assigned] (SPARK-13992) Add support for off-heap caching

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13992:


Assignee: Josh Rosen  (was: Apache Spark)

> Add support for off-heap caching
> 
>
> Key: SPARK-13992
> URL: https://issues.apache.org/jira/browse/SPARK-13992
> Project: Spark
>  Issue Type: New Feature
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should add support for caching serialized data off-heap within the same 
> process (i.e. using direct buffers or sun.misc.unsafe).
> I'll expand this JIRA later with more detail (filing now as a placeholder).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13967) Add binary toggle Param to PySpark CountVectorizer

2016-03-19 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201339#comment-15201339
 ] 

Nick Pentreath commented on SPARK-13967:


[~yuhaoyan] or [~bryanc] would you like to take this?

> Add binary toggle Param to PySpark CountVectorizer
> --
>
> Key: SPARK-13967
> URL: https://issues.apache.org/jira/browse/SPARK-13967
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> See SPARK-13629



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13989) Remove non-vectorized/unsafe-row parquet record reader

2016-03-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13989.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11799
[https://github.com/apache/spark/pull/11799]

> Remove non-vectorized/unsafe-row parquet record reader
> --
>
> Key: SPARK-13989
> URL: https://issues.apache.org/jira/browse/SPARK-13989
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Priority: Minor
> Fix For: 2.0.0
>
>
> Clean up the new parquet record reader by removing the non-vectorized parquet 
> reader code from `UnsafeRowParquetRecordReader`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14016) Support high-precision decimals in vectorized parquet reader

2016-03-19 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-14016:
--

 Summary: Support high-precision decimals in vectorized parquet 
reader
 Key: SPARK-14016
 URL: https://issues.apache.org/jira/browse/SPARK-14016
 Project: Spark
  Issue Type: Sub-task
Reporter: Sameer Agarwal






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13986) Make `DeveloperApi`-annotated things public

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200489#comment-15200489
 ] 

Apache Spark commented on SPARK-13986:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11797

> Make `DeveloperApi`-annotated things public
> ---
>
> Key: SPARK-13986
> URL: https://issues.apache.org/jira/browse/SPARK-13986
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Spark uses `@DeveloperApi` annotation, but sometimes it seems to conflict 
> with its visibility. This issue proposes to fix those conflict. The following 
> is the example.
> {code:title=JobResult.scala|borderStyle=solid}
> @DeveloperApi
> sealed trait JobResult
> @DeveloperApi
> case object JobSucceeded extends JobResult
> @DeveloperApi
> -private[spark] case class JobFailed(exception: Exception) extends JobResult
> +case class JobFailed(exception: Exception) extends JobResult
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-19 Thread Eugene Miretsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203018#comment-15203018
 ] 

Eugene Miretsky commented on SPARK-12177:
-

The new Kafka Java Consumer is using Deserializer instead of Decoder. The 
difference is not too big (extra type safety, and Deserializer::deserialize  
accepts a topic and a byte payload, while Decoder::fromBytes accepts only a 
byte payload), but still it would be nice to align with the new Kafka consumer. 
Would it make sense to replace Decoder with Deserializer in the new 
DirectStream? This would require getting rid of MessageAndMetadata, and hence 
breaking backwards compatibility with the existing  DirectStream, but I guess 
it will have to be done at some point. 


> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13976) do not remove sub-queries added by user when generate SQL

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13976:


Assignee: Apache Spark

> do not remove sub-queries added by user when generate SQL
> -
>
> Key: SPARK-13976
> URL: https://issues.apache.org/jira/browse/SPARK-13976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13940) Predicate Transitive Closure Transformation

2016-03-19 Thread Alex Antonov (JIRA)
Alex Antonov created SPARK-13940:


 Summary: Predicate Transitive Closure Transformation
 Key: SPARK-13940
 URL: https://issues.apache.org/jira/browse/SPARK-13940
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 1.6.0
Reporter: Alex Antonov


A relatively simple transformation is missing from Catalyst's arsenal - 
generation of transitive predicates. For instance, if you have got the 
following query:
{code}
select * 
from   table1 t1
join   table2 t2
on t1.a = t2.b
where  t1.a = 42
{code}
then it is a fair assumption that t2.b also equals 42 hence an additional 
predicate could be generated. The additional predicate could in turn be pushed 
down through the join and improve performance of the whole query by filtering 
out the data before joining it.
Such an transformation exists in Oracle DB and called transitive closure which 
hopefully should explain the title of this Jira



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13937) PySpark ML JavaWrapper, variable _java_obj should not be static

2016-03-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13937:
--
Priority: Trivial  (was: Minor)

> PySpark ML JavaWrapper, variable _java_obj should not be static
> ---
>
> Key: SPARK-13937
> URL: https://issues.apache.org/jira/browse/SPARK-13937
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
> Fix For: 2.0.0
>
>
> In PySpark ML wrapper.py, the abstract class {{JavaWrapper}} has a static 
> variable {{_java_obj}}.  This is meant to hold an instance of a companion 
> Java object.  It seems as though it was made static accidentally because it 
> is never used, and all assignments done in derived classes are done to a 
> member variable with {{self._java_obj}}.  This does not cause any problems 
> with the current functionality, but it should be changed so as not to cause 
> any confusion and misuse in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13938) word2phrase feature created in ML

2016-03-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13938:
--

[~s4weng] "Critical" is inappropriate here. Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first.

It's great you're implementing things on Spark, but they generally don't belong 
in Spark itself. I'm going to close this but you can start by providing your 
package via spark-packages.org

> word2phrase feature created in ML
> -
>
> Key: SPARK-13938
> URL: https://issues.apache.org/jira/browse/SPARK-13938
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Steve Weng
>Priority: Critical
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which 
> transforms a sentence of words into one where certain individual consecutive 
> words are concatenated by using a training model/estimator (e.g. "I went to 
> New York" becomes "I went to new_york").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-913) log the size of each shuffle block in block manager

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-913:
--

Assignee: Apache Spark

> log the size of each shuffle block in block manager
> ---
>
> Key: SPARK-913
> URL: https://issues.apache.org/jira/browse/SPARK-913
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13973) `ipython notebook` is going away...

2016-03-19 Thread Bogdan Pirvu (JIRA)
Bogdan Pirvu created SPARK-13973:


 Summary: `ipython notebook` is going away...
 Key: SPARK-13973
 URL: https://issues.apache.org/jira/browse/SPARK-13973
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
 Environment: spark-1.6.1-bin-hadoop2.6
Anaconda2-2.5.0-Linux-x86_64
Reporter: Bogdan Pirvu
Priority: Trivial


Starting {{pyspark}} with following environment variables:

{code:none}
export IPYTHON=1
export IPYTHON_OPTS="notebook --no-browser"
{code}

yields this warning

{code:none}
[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and 
will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
continue in 5 sec. Press Ctrl-C to quit now.
{code}


Changing line 52 from
{code:none}
PYSPARK_DRIVER_PYTHON="ipython"
{code}
to
{code:none}
PYSPARK_DRIVER_PYTHON="jupyter"
{code}
in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
solve this issue, but I'm not sure if it's sustainable as I'm not familiar with 
the rest of the code...

This is the relevant part of my Python environment:

{code:none}
ipython   4.1.2py27_0  
ipython-genutils  0.1.0 
ipython_genutils  0.1.0py27_0  
ipywidgets4.1.1py27_0  
...
jupyter   1.0.0py27_1  
jupyter-client4.2.1 
jupyter-console   4.1.1 
jupyter-core  4.1.0 
jupyter_client4.2.1py27_0  
jupyter_console   4.1.1py27_0  
jupyter_core  4.1.0py27_0
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13974) sub-query names do not need to be globally unique while generate SQL

2016-03-19 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13974:
---

 Summary: sub-query names do not need to be globally unique while 
generate SQL
 Key: SPARK-13974
 URL: https://issues.apache.org/jira/browse/SPARK-13974
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13360) pyspark related enviroment variable is not propagated to driver in yarn-cluster mode

2016-03-19 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-13360.

   Resolution: Fixed
 Assignee: Jeff Zhang
Fix Version/s: 2.0.0

> pyspark related enviroment variable is not propagated to driver in 
> yarn-cluster mode
> 
>
> Key: SPARK-13360
> URL: https://issues.apache.org/jira/browse/SPARK-13360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 2.0.0
>
>
> Such as PYSPARK_DRIVER_PYTHON, PYSPARK_PYTHON, PYTHONHASHSEED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14001) support multi-children Union in SQLBuilder

2016-03-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14001:
---
Assignee: Wenchen Fan

> support multi-children Union in SQLBuilder
> --
>
> Key: SPARK-14001
> URL: https://issues.apache.org/jira/browse/SPARK-14001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-19 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200306#comment-15200306
 ] 

Hari Shreedharan commented on SPARK-13877:
--

You could have separate repos and separate releases, and keep the same package 
names simply by doing sub-projects. Can you explain what the overhead is and 
what tools you are concerned about? 

> Consider removing Kafka modules from Spark / Spark Streaming
> 
>
> Key: SPARK-13877
> URL: https://issues.apache.org/jira/browse/SPARK-13877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1
>Reporter: Hari Shreedharan
>
> Based on the discussion the PR for SPARK-13843 
> ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), 
> we should consider moving the Kafka modules out of Spark as well. 
> Providing newer functionality (like security) has become painful while 
> maintaining compatibility with older versions of Kafka. Moving this out 
> allows more flexibility, allowing users to mix and match Kafka and Spark 
> versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13993) PySpark ml.feature.RFormula/RFormulaModel support export/import

2016-03-19 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-13993:
-

 Summary: PySpark ml.feature.RFormula/RFormulaModel support 
export/import
 Key: SPARK-13993
 URL: https://issues.apache.org/jira/browse/SPARK-13993
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Xusen Yin
Priority: Minor


Add save/load for RFormula and its model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13937) PySpark ML JavaWrapper, variable _java_obj should not be static

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13937:


Assignee: (was: Apache Spark)

> PySpark ML JavaWrapper, variable _java_obj should not be static
> ---
>
> Key: SPARK-13937
> URL: https://issues.apache.org/jira/browse/SPARK-13937
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> In PySpark ML wrapper.py, the abstract class {{JavaWrapper}} has a static 
> variable {{_java_obj}}.  This is meant to hold an instance of a companion 
> Java object.  It seems as though it was made static accidentally because it 
> is never used, and all assignments done in derived classes are done to a 
> member variable with {{self._java_obj}}.  This does not cause any problems 
> with the current functionality, but it should be changed so as not to cause 
> any confusion and misuse in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-03-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199198#comment-15199198
 ] 

Sean Owen commented on SPARK-13955:
---

Is this likely? the YARN tests succeed. There isn't detail here like what you 
are running.

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> Seems the spark assembly jar is not uploaded to AM. This may be known issue 
> in the process of SPARK-11157, create this ticket to track this issue. 
> [~vanzin]
> {noformat}
> 16/03/17 11:58:59 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 11:58:59 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.10.jar
> 16/03/17 11:58:59 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.11.jar
> 16/03/17 11:59:00 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-36cacbad-ca5b-482b-8ca8-607499acaaba/__spark_conf__4427292248554277597.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/__spark_conf__4427292248554277597.zip
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13719) Bad JSON record raises java.lang.ClassCastException

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13719:


Assignee: (was: Apache Spark)

> Bad JSON record raises java.lang.ClassCastException
> 
>
> Key: SPARK-13719
> URL: https://issues.apache.org/jira/browse/SPARK-13719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: OS X, Linux
>Reporter: dmtran
>Priority: Minor
>
> I have defined a JSON schema, using org.apache.spark.sql.types.StructType, 
> that expects this kind of record :
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> There's a bad record in my dataset, that defines field "user" as an array, 
> instead of a JSON object :
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> The following exception is raised because of that bad record :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Here's a code snippet that reproduces the exception :
> {noformat}
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.{SQLContext, DataFrame}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> object Snippet {
>   def main(args : Array[String]): Unit = {
> val sc = new SparkContext()
> implicit val sqlContext = new HiveContext(sc)
> val rdd: RDD[String] = sc.parallelize(Seq(badRecord))
> val df: DataFrame = sqlContext.read.schema(schema).json(rdd)
> import sqlContext.implicits._
> df.select("request.user.id")
>   .filter($"id".isNotNull)
>   .count()
>   }
>   val badRecord =
> s"""{
> |  "request": {
> |"user": []
> |  }
> |}""".stripMargin.replaceAll("\n", " ") // Convert the multiline 
> string to a signe line string
>   val schema =
> StructType(
>   StructField("request", 

[jira] [Commented] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set

2016-03-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198602#comment-15198602
 ] 

Xiao Li commented on SPARK-13864:
-

This is the same issue as SPARK-13862. I think we can close this. Thanks!

> TPCDS query 74 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13864
> URL: https://issues.apache.org/jira/browse/SPARK-13864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 74 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Spark SQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> {noformat}
> [BLEIBAAA,Paula,Wakefield]
> [DFIEBAAA,John,Gray]
> [OCLBBAAA,null,null]
> [PKBCBAAA,Andrea,White]
> [EJDL,Alice,Wright]
> [FACE,Priscilla,Miller]
> [LFKK,Ignacio,Miller]
> [LJNCBAAA,George,Gamez]
> [LIOP,Derek,Allen]
> [EADJ,Ruth,Carroll]
> [JGMM,Richard,Larson]
> [PKIK,Wendy,Horvath]
> [FJHF,Larissa,Roy]
> [EPOG,Felisha,Mendes]
> [EKJL,Aisha,Carlson]
> [HNFH,Rebecca,Wilson]
> [IBFCBAAA,Ruth,Grantham]
> [OPDL,Ann,Pence]
> [NIPL,Eric,Lawrence]
> [OCIC,Zachary,Pennington]
> [OFLC,James,Taylor]
> [GEHI,Tyler,Miller]
> [CADP,Cristobal,Thomas]
> [JIAL,Santos,Gutierrez]
> [PMMBBAAA,Paul,Jordan]
> [DIIO,David,Carroll]
> [DFKABAAA,Latoya,Craft]
> [HMOI,Grace,Henderson]
> [PPIBBAAA,Candice,Lee]
> [JONHBAAA,Warren,Orozco]
> [GNDA,Terry,Mcdowell]
> [CIJM,Elizabeth,Thomas]
> [DIJGBAAA,Ruth,Sanders]
> [NFBDBAAA,Vernice,Fernandez]
> [IDKF,Michael,Mack]
> [IMHB,Kathy,Knowles]
> [LHMC,Brooke,Nelson]
> [CFCGBAAA,Marcus,Sanders]
> [NJHCBAAA,Christopher,Schreiber]
> [PDFB,Terrance,Banks]
> [ANFA,Philip,Banks]
> [IADEBAAA,Diane,Aldridge]
> [ICHF,Linda,Mccoy]
> [CFEN,Christopher,Dawson]
> [KOJJ,Gracie,Mendoza]
> [FOJA,Don,Castillo]
> [FGPG,Albert,Wadsworth]
> [KJBK,Georgia,Scott]
> [EKFP,Annika,Chin]
> [IBAEBAAA,Sandra,Wilson]
> [MFFL,Margret,Gray]
> [KNAK,Gladys,Banks]
> [CJDI,James,Kerr]
> [OBADBAAA,Elizabeth,Burnham]
> [AMGD,Kenneth,Harlan]
> [HJLA,Audrey,Beltran]
> [AOPFBAAA,Jerry,Fields]
> [CNAGBAAA,Virginia,May]
> [HGOABAAA,Sonia,White]
> [KBCABAAA,Debra,Bell]
> [NJAG,Allen,Hood]
> [MMOBBAAA,Margaret,Smith]
> [NGDBBAAA,Carlos,Jewell]
> [FOGI,Michelle,Greene]
> [JEKFBAAA,Norma,Burkholder]
> [OCAJ,Jenna,Staton]
> [PFCL,Felicia,Neville]
> [DLHBBAAA,Henry,Bertrand]
> [DBEFBAAA,Bennie,Bowers]
> [DCKO,Robert,Gonzalez]
> [KKGE,Katie,Dunbar]
> [GFMDBAAA,Kathleen,Gibson]
> [IJEM,Charlie,Cummings]
> [KJBL,Kerry,Davis]
> [JKBN,Julie,Kern]
> [MDCA,Louann,Hamel]
> [EOAK,Molly,Benjamin]
> [IBHH,Jennifer,Ballard]
> [PJEN,Ashley,Norton]
> [KLHHBAAA,Manuel,Castaneda]
> [IMHHBAAA,Lillian,Davidson]
> [GHPBBAAA,Nick,Mendez]
> [BNBB,Irma,Smith]
> [FBAH,Michael,Williams]
> [PEHEBAAA,Edith,Molina]
> [FMHI,Emilio,Darling]
> [KAEC,Milton,Mackey]
> [OCDJ,Nina,Sanchez]
> [FGIG,Eduardo,Miller]
> [FHACBAAA,null,null]
> [HMJN,Ryan,Baptiste]
> [HHCABAAA,William,Stewart]
> {noformat}
> Expected results:
> {noformat}
> +--+-++
> | CUSTOMER_ID  | CUSTOMER_FIRST_NAME | CUSTOMER_LAST_NAME |
> +--+-++
> | AMGD | Kenneth | Harlan |
> | ANFA | Philip  | Banks  |
> | AOPFBAAA | Jerry   | Fields |
> | BLEIBAAA | Paula   | Wakefield  |
> | BNBB | Irma| Smith  |
> | CADP | Cristobal   | Thomas |
> | CFCGBAAA | Marcus  | Sanders|
> | CFEN | 

[jira] [Updated] (SPARK-12719) SQL generation support for generators (including UDTF)

2016-03-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12719:
---
Assignee: Wenchen Fan

> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200886#comment-15200886
 ] 

JESSE CHEN commented on SPARK-13865:


You rock!

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13886) ArrayType of BinaryType not supported in Row.equals method

2016-03-19 Thread Rishabh Bhardwaj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198785#comment-15198785
 ] 

Rishabh Bhardwaj commented on SPARK-13886:
--

If we go through the implementation of `a.equals(b)` in Row then this 
comparison boils down comparing `Array(1.toByte) == Array(1.toByte)`, and since 
scala uses java Array which is jvm binded so this comparison return false.This 
is not the case if you use List.This is explained here in detail: 
http://goo.gl/1zVjnx
Correct me if I am going off track here.

> ArrayType of BinaryType not supported in Row.equals method 
> ---
>
> Key: SPARK-13886
> URL: https://issues.apache.org/jira/browse/SPARK-13886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: MahmoudHanafy
>Priority: Minor
>
> There are multiple types that are supoprted by Spark SQL, One of them is 
> ArrayType(Seq) which can be of any element type
> So it can be BinaryType(Array\[Byte\])
> In equals method in Row class, there is no handling for ArrayType of 
> BinaryType.
> So for example:
> {code:xml}
> val a = Row( Seq( Array(1.toByte) ) )
> val b = Row( Seq( Array(1.toByte) ) )
> a.equals(b) // this will return false
> {code}
> Also, this doesn't work for MapType of BinaryType.
> {code:xml}
> val a = Row( Map(1 -> Array(1.toByte) ) )
> val b = Row( Map(1 -> Array(1.toByte) ) )
> a.equals(b) // this will return false
> {code}
> Question1: Can the key in MapType be of BinaryType ?
> Question2: Isn't there another way to handle BinaryType by using scala type 
> instead of Array ?
> I want to contribute by fixing this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13942) Remove Shark-related docs and visibility for 2.x

2016-03-19 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-13942:
-

 Summary: Remove Shark-related docs and visibility for 2.x
 Key: SPARK-13942
 URL: https://issues.apache.org/jira/browse/SPARK-13942
 Project: Spark
  Issue Type: Task
  Components: Documentation, Spark Core
Reporter: Dongjoon Hyun
Priority: Minor


`Shark` was merged into `Spark SQL` since [July 
2014|https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html].
 

The followings seem to be the only legacy.

*Migration Guide*
{code:title=sql-programming-guide.md|borderStyle=solid}
- ## Migration Guide for Shark Users
- ...
- ### Scheduling
- ...
- ### Reducer number
- ...
- ### Caching
{code}

*SparkEnv visibility and comments*
{code:title=sql-programming-guide.md|borderStyle=solid}
- *
- * NOTE: This is not intended for external use. This is exposed for Shark and 
may be made private
- *   in a future release.
  */
 @DeveloperApi
-class SparkEnv (
+private[spark] class SparkEnv (
{code}

For Spark 2.x, we had better clean up those docs and comments in any way. 
However, the visibility of `SparkEnv` class might be controversial. 

At the first attempt, this issue proposes to change both stuff according to the 
note(`This is exposed for Shark`). During review process, the change on 
visibility might be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14001) support multi-children Union in SQLBuilder

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201191#comment-15201191
 ] 

Apache Spark commented on SPARK-14001:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11818

> support multi-children Union in SQLBuilder
> --
>
> Key: SPARK-14001
> URL: https://issues.apache.org/jira/browse/SPARK-14001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13664) Simplify and Speedup HadoopFSRelation

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13664:


Assignee: Michael Armbrust  (was: Apache Spark)

> Simplify and Speedup HadoopFSRelation
> -
>
> Key: SPARK-13664
> URL: https://issues.apache.org/jira/browse/SPARK-13664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 2.0.0
>
>
> A majority of Spark SQL queries likely run though {{HadoopFSRelation}}, 
> however there are currently several complexity and performance problems with 
> this code path:
>  - The class mixes the concerns of file management, schema reconciliation, 
> scan building, bucketing, partitioning, and writing data.
>  - For very large tables, we are broadcasting the entire list of files to 
> every executor. [SPARK-11441]
>  - For partitioned tables, we always do an extra projection.  This results 
> not only in a copy, but undoes much of the performance gains that we are 
> going to get from vectorized reads.
> This is an umbrella ticket to track a set of improvements to this codepath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   >