date:20150610


 [ 
https://issues.apache.org/jira/browse/SPARK-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8169:
---

Assignee: Apache Spark

 Add StopWordsRemover as a transformer
 -

 Key: SPARK-8169
 URL: https://issues.apache.org/jira/browse/SPARK-8169
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Apache Spark

 StopWordsRemover takes a string array column and outputs a string array 
 column with all defined stop words removed. The transformer should also come 
 with a standard set of stop words as default.
 {code}
 val stopWords = new StopWordsRemover()
   .setInputCol(words)
   .setOutputCol(cleanWords)
   .setStopWords(Array(...)) // optional
 val output = stopWords.transform(df)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8287) Filter not push down through Subquery or View


[ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580423#comment-14580423
 ] 

Li Sheng commented on SPARK-8287:
-

I'm working on this in a hack way and I think is not a proper way to solve this.
Firstly, I collected the expression of filter  and the attributesReferences in 
the Filter on the top of the Subquery or View.
And match the attributesRef of each MetastoreRelation with the collected  
expression atrributesRef, once found the match, I will remove the atrributeRef 
related expression in the filter on the top of the Subquery. and Generate a new 
Filter on top of the metastoreRelation in the Subquery.

eg:
The `ds` is the attributeRef in the Filter on the top of the Subquery or View.
Firstly I check the ouput of the `src_partitioned1` and `src_partitioned2`, if 
they contains the field `ds`.
I will push the   ds = '2'  down to the top of src_partitioned1 and 
src_partitioned2.


But I found that if the outer query filter on top of Subquery contains Alias. 
Things become more complex, because we don't know wether the alias is a real 
field in the MetastoreRelation or just a alias.

eg: 
if  `Subquery`  name   sum(1) as ds,  ds is a alias.   when do process above, 
is a wrong way, because `ds` is conflict(both exists in src_partitioned1 and 
Subquery result, can not easily push down it).

BTW: Hive can know the `ds` is a partition key and push down it.

I'd be very pleasure if you have some good suggestions to solve this. 

 Filter not push down through Subquery or View
 -

 Key: SPARK-8287
 URL: https://issues.apache.org/jira/browse/SPARK-8287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.4.0

   Original Estimate: 40h
  Remaining Estimate: 40h

 Filter not push down through Subquery or View. Assume we have two big  
 partitioned table join inner a Subquery or a View and filter not push down, 
 this will cause a full partition join and will cause performance issues.
 Let me give and example that can reproduce the problem:
 {code:sql}
 create table src(key int, value string);
 -- Creates partitioned table and imports data
 CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds 
 STRING);
 insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value 
 from src;
 insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value 
 from src;
 CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds 
 STRING);
 insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value 
 from src;
 insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value 
 from src;
 -- Creates views
 create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from 
 src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
 create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds 
 from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
 -- QueryExecution
 select * from dw.src_view where ds='2'
 {code}
 {noformat}
 sql(select * from dw.src_view where ds='2' ).queryExecution
 == Parsed Logical Plan ==
 'Project [*]
  'Filter ('ds = 2)
   'UnresolvedRelation [dw,src_view], None
 == Analyzed Logical Plan ==
 Project [s1#60L,s2#61L,ds#62]
  Filter (ds#62 = 2)
   Subquery src_view
Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
 LongType)) AS s2#61L,ds#66 AS ds#62]
 Join Inner, Some((ds#63 = ds#66))
  MetastoreRelation dw, src_partitioned1, Some(a)
  MetastoreRelation dw, src_partitioned2, Some(b)
 == Optimized Logical Plan ==
 Filter (ds#62 = 2)
  Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
 LongType)) AS s2#61L,ds#66 AS ds#62]
   Project [ds#66,key#64,key#67]
Join Inner, Some((ds#63 = ds#66))
 Project [key#64,ds#63]
  MetastoreRelation dw, s...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8287) Filter not push down through Subquery or View


 [ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8287:
--
Description: 
Filter not push down through Subquery or View. Assume we have two big  
partitioned table join inner a Subquery or a View and filter not push down, 
this will cause a full partition join and will cause performance issues.

Let me give and example that can reproduce the problem:
{code:sql}
create table src(key int, value string);

-- Creates partitioned table and imports data
CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds 
STRING);

insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value 
from src;
insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value 
from src;

CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds 
STRING);

insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value 
from src;
insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value 
from src;

-- Creates views
create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from 
src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from 
src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds

-- QueryExecution
select * from dw.src_view where ds='2'
{code}

{noformat}
sql(select * from dw.src_view where ds='2' ).queryExecution

== Parsed Logical Plan ==
'Project [*]
 'Filter ('ds = 2)
  'UnresolvedRelation [dw,src_view], None

== Analyzed Logical Plan ==
Project [s1#60L,s2#61L,ds#62]
 Filter (ds#62 = 2)
  Subquery src_view
   Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
LongType)) AS s2#61L,ds#66 AS ds#62]
Join Inner, Some((ds#63 = ds#66))
 MetastoreRelation dw, src_partitioned1, Some(a)
 MetastoreRelation dw, src_partitioned2, Some(b)

== Optimized Logical Plan ==
Filter (ds#62 = 2)
 Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
LongType)) AS s2#61L,ds#66 AS ds#62]
  Project [ds#66,key#64,key#67]
   Join Inner, Some((ds#63 = ds#66))
Project [key#64,ds#63]
 MetastoreRelation dw, s...
{noformat}

  was:
Filter not push down through Subquery or View. Assume we have two big  
partitioned table join inner a Subquery or a View and filter not push down, 
this will cause a full partition join and will cause performance issues.

Let me give and example that can reproduce the problem:

create table src(key int, value string);

--创建分区表并且导入数据
CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds 
STRING);

insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value 
from src;
insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value 
from src;

CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds 
STRING);

insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value 
from src;
insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value 
from src;

--创建视图
create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from 
src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds


create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from 
src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds


--QueryExecution
select * from dw.src_view where ds='2'


sql(select * from dw.src_view where ds='2' ).queryExecution

== Parsed Logical Plan ==
'Project [*]
 'Filter ('ds = 2)
  'UnresolvedRelation [dw,src_view], None

== Analyzed Logical Plan ==
Project [s1#60L,s2#61L,ds#62]
 Filter (ds#62 = 2)
  Subquery src_view
   Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
LongType)) AS s2#61L,ds#66 AS ds#62]
Join Inner, Some((ds#63 = ds#66))
 MetastoreRelation dw, src_partitioned1, Some(a)
 MetastoreRelation dw, src_partitioned2, Some(b)

== Optimized Logical Plan ==
Filter (ds#62 = 2)
 Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
LongType)) AS s2#61L,ds#66 AS ds#62]
  Project [ds#66,key#64,key#67]
   Join Inner, Some((ds#63 = ds#66))
Project [key#64,ds#63]
 MetastoreRelation dw, s...


 Filter not push down through Subquery or View
 -

 Key: SPARK-8287
 URL: https://issues.apache.org/jira/browse/SPARK-8287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.4.0

   Original Estimate: 40h
  Remaining Estimate: 40h

 Filter not push down through Subquery or View. Assume we have two big  
 partitioned table join inner a Subquery or a View and filter not push down, 
 this will cause a full partition join and will cause

[jira] [Commented] (SPARK-8048) Explicit partitionning of an RDD with 0 partition will yield empty outer join

2015-06-10 Thread Josiah Samuel Sathiadass (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580395#comment-14580395
 ] 

Josiah Samuel Sathiadass commented on SPARK-8048:
-

Often constructing a HashPartitioner with 0 may not be intentional. 
So instead of a fall through, consider making the call to fail with an 
exception.


 Explicit partitionning of an RDD with 0 partition will yield empty outer join
 -

 Key: SPARK-8048
 URL: https://issues.apache.org/jira/browse/SPARK-8048
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Olivier Toupin
Priority: Minor

 Check this code =
 https://gist.github.com/anonymous/0f935915f2bc182841f0
 Because of this = {{.partitionBy(new HashPartitioner(0))}}
 The join will return empty result.
 Here a normal expected behaviour would the join to crash, cause error, or to 
 return unjoined results, but instead will yield an empty RDD.
 This a trivial exemple, but imagine: 
 {{.partitionBy(new HashPartitioner(previous.partitions.length))}}. 
 You join on an empty previous rdd, the lookup table is empty, Spark will 
 you lose all your results, instead of returning unjoined results, and this 
 without warnings or errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8287) Filter not push down through Subquery or View


[ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403
 ] 

Li Sheng edited comment on SPARK-8287 at 6/10/15 11:16 AM:
---

Sorry [~lian cheng] , see this:
scala sql(select * from dw.src_view where ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view 
where ds='2'
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view  
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join 
`dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned1  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned1 
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned2  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned2 
res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Filter (ds#936 = 2)
 Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, 
LongType)) AS s2#935L,ds#940 AS ds#936]
  Project [ds#940,key#938,key#941]
   Join Inner, Some((ds#937 = ds#940))
Project [key#938,ds#937]
 MetastoreRelation dw, src_partitioned1, Some(a)
Project [ds#940,key#941]
 MetastoreRelation dw, src_partitioned2, Some(b)

Another one is for subquery , also can not push down the filter:

scala sql(select * from dw.src_view_1 where my_ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from 
dw.src_view_1 where my_ds='2'
15/06/10 19:08:59 INFO ParseDriver: Parse Completed
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` 
join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:08:59 INFO ParseDriver: Parse Completed
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned1  
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned1 
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned2  
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned2
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned2 
res254: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Filter (my_ds#918 = 2)
 Aggregate [ds#922], [SUM(CAST(key#920, LongType)) AS s1#916L,SUM(CAST(key#923, 
LongType)) AS s2#917L,ds#922 AS my_ds#918]
  Project [ds#922,key#920,key#923]
   Join Inner, Some((ds#919 = ds#922))
Project [key#920,ds#919]
 MetastoreRelation dw, src_partitioned1, Some(a)
Project [ds#922,key#923]
 MetastoreRelation dw, src_partitioned2, Some(b)



was (Author: oopsoutofmemory):
Sorry [~lian cheng] , see this:

scala sql(select * from dw.src_view_1 where my_ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from 
dw.src_view_1 where my_ds='2'
15/06/10 19:08:59 INFO ParseDriver: Parse Completed
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` 
join `dw`.`src_partitioned2`

[jira] [Commented] (SPARK-6797) Add support for YARN cluster mode


[ 
https://issues.apache.org/jira/browse/SPARK-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580417#comment-14580417
 ] 

Apache Spark commented on SPARK-6797:
-

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/6743

 Add support for YARN cluster mode
 -

 Key: SPARK-6797
 URL: https://issues.apache.org/jira/browse/SPARK-6797
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Sun Rui
Priority: Critical

 SparkR currently does not work in YARN cluster mode as the R package is not 
 shipped along with the assembly jar to the YARN AM. We could try to use the 
 support for archives in YARN to send out the R package as a zip file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6797) Add support for YARN cluster mode


 [ 
https://issues.apache.org/jira/browse/SPARK-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6797:
---

Assignee: Sun Rui  (was: Apache Spark)

 Add support for YARN cluster mode
 -

 Key: SPARK-6797
 URL: https://issues.apache.org/jira/browse/SPARK-6797
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Sun Rui
Priority: Critical

 SparkR currently does not work in YARN cluster mode as the R package is not 
 shipped along with the assembly jar to the YARN AM. We could try to use the 
 support for archives in YARN to send out the R package as a zip file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6797) Add support for YARN cluster mode


 [ 
https://issues.apache.org/jira/browse/SPARK-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6797:
---

Assignee: Apache Spark  (was: Sun Rui)

 Add support for YARN cluster mode
 -

 Key: SPARK-6797
 URL: https://issues.apache.org/jira/browse/SPARK-6797
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Apache Spark
Priority: Critical

 SparkR currently does not work in YARN cluster mode as the R package is not 
 shipped along with the assembly jar to the YARN AM. We could try to use the 
 support for archives in YARN to send out the R package as a zip file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8287) Filter not push down through Subquery or View


[ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580390#comment-14580390
 ] 

Cheng Lian commented on SPARK-8287:
---

[~OopsOutOfMemory] Could you please provide the output of
{code}
println(sql(select * from dw.src_view where ds='2').queryExecution)
{code}
instead of
{code}
sql(select * from dw.src_view where ds='2').queryExecution
{code}
The latter gets stripped and doesn't show the whole contents.

 Filter not push down through Subquery or View
 -

 Key: SPARK-8287
 URL: https://issues.apache.org/jira/browse/SPARK-8287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.4.0

   Original Estimate: 40h
  Remaining Estimate: 40h

 Filter not push down through Subquery or View. Assume we have two big  
 partitioned table join inner a Subquery or a View and filter not push down, 
 this will cause a full partition join and will cause performance issues.
 Let me give and example that can reproduce the problem:
 {code:sql}
 create table src(key int, value string);
 -- Creates partitioned table and imports data
 CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds 
 STRING);
 insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value 
 from src;
 insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value 
 from src;
 CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds 
 STRING);
 insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value 
 from src;
 insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value 
 from src;
 -- Creates views
 create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from 
 src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
 create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds 
 from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
 -- QueryExecution
 select * from dw.src_view where ds='2'
 {code}
 {noformat}
 sql(select * from dw.src_view where ds='2' ).queryExecution
 == Parsed Logical Plan ==
 'Project [*]
  'Filter ('ds = 2)
   'UnresolvedRelation [dw,src_view], None
 == Analyzed Logical Plan ==
 Project [s1#60L,s2#61L,ds#62]
  Filter (ds#62 = 2)
   Subquery src_view
Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
 LongType)) AS s2#61L,ds#66 AS ds#62]
 Join Inner, Some((ds#63 = ds#66))
  MetastoreRelation dw, src_partitioned1, Some(a)
  MetastoreRelation dw, src_partitioned2, Some(b)
 == Optimized Logical Plan ==
 Filter (ds#62 = 2)
  Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
 LongType)) AS s2#61L,ds#66 AS ds#62]
   Project [ds#66,key#64,key#67]
Join Inner, Some((ds#63 = ds#66))
 Project [key#64,ds#63]
  MetastoreRelation dw, s...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8287) Filter not push down through Subquery or View


[ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403
 ] 

Li Sheng edited comment on SPARK-8287 at 6/10/15 11:11 AM:
---

Sorry [~lian cheng] , see this:

scala sql(select * from dw.src_view_1 where my_ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from 
dw.src_view_1 where my_ds='2'
15/06/10 19:08:59 INFO ParseDriver: Parse Completed
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` 
join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:08:59 INFO ParseDriver: Parse Completed
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned1  
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned1 
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned2  
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned2
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned2 
res254: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Filter (my_ds#918 = 2)
 Aggregate [ds#922], [SUM(CAST(key#920, LongType)) AS s1#916L,SUM(CAST(key#923, 
LongType)) AS s2#917L,ds#922 AS my_ds#918]
  Project [ds#922,key#920,key#923]
   Join Inner, Some((ds#919 = ds#922))
Project [key#920,ds#919]
 MetastoreRelation dw, src_partitioned1, Some(a)
Project [ds#922,key#923]
 MetastoreRelation dw, src_partitioned2, Some(b)



was (Author: oopsoutofmemory):
Sorry , see this:

scala sql(select * from dw.src_view_1 where my_ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from 
dw.src_view_1 where my_ds='2'
15/06/10 19:08:59 INFO ParseDriver: Parse Completed
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` 
join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:08:59 INFO ParseDriver: Parse Completed
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned1  
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned1 
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned2  
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned2
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned2 
res254: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Filter (my_ds#918 = 2)
 Aggregate [ds#922], [SUM(CAST(key#920, LongType)) AS s1#916L,SUM(CAST(key#923, 
LongType)) AS s2#917L,ds#922 AS my_ds#918]
  Project [ds#922,key#920,key#923]
   Join Inner, Some((ds#919 = ds#922))
Project [key#920,ds#919]
 MetastoreRelation dw, src_partitioned1, Some(a)
Project [ds#922,key#923]
 MetastoreRelation dw, src_partitioned2, Some(b)


 Filter not push down through Subquery or View
 -

 Key: SPARK-8287
 URL: https://issues.apache.org/jira/browse/SPARK-8287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.4.0

   Original Estimate: 40h
  Remaining Estimate: 40h

 Filter not push down through Subquery or View. Assume we have two big  
 partitioned table join inner a Subquery or a View and filter not push down, 
 this will cause a full partition join and will cause performance issues.
 Let me give and

[jira] [Comment Edited] (SPARK-8287) Filter not push down through Subquery or View


[ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403
 ] 

Li Sheng edited comment on SPARK-8287 at 6/10/15 11:20 AM:
---

Sorry [~lian cheng] , see this:
scala sql(select * from dw.src_view where ds='2' ).queryExecution.analyzed
res258: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Project [s1#943L,s2#944L,ds#945]
 Filter (ds#945 = 2)
  Subquery src_view
   Aggregate [ds#949], [SUM(CAST(key#947, LongType)) AS 
s1#943L,SUM(CAST(key#950, LongType)) AS s2#944L,ds#949 AS ds#945]
Join Inner, Some((ds#946 = ds#949))
 MetastoreRelation dw, src_partitioned1, Some(a)
 MetastoreRelation dw, src_partitioned2, Some(b)

scala sql(select * from dw.src_view where ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view 
where ds='2'
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view  
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join 
`dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned1  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned1 
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned2  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned2 
res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Filter (ds#936 = 2)
 Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, 
LongType)) AS s2#935L,ds#940 AS ds#936]
  Project [ds#940,key#938,key#941]
   Join Inner, Some((ds#937 = ds#940))
Project [key#938,ds#937]
 MetastoreRelation dw, src_partitioned1, Some(a)
Project [ds#940,key#941]
 MetastoreRelation dw, src_partitioned2, Some(b)

scala sql(select * from dw.src_view where ds='2' 
).queryExecution.executedPlan
res259: org.apache.spark.sql.execution.SparkPlan = 
Filter (ds#954 = 2)
 Aggregate false, [ds#958], [SUM(PartialSum#963L) AS 
s1#952L,SUM(PartialSum#964L) AS s2#953L,ds#958 AS ds#954]
  Exchange (HashPartitioning [ds#958], 200)
   Aggregate true, [ds#958], [ds#958,SUM(CAST(key#956, LongType)) AS 
PartialSum#963L,SUM(CAST(key#959, LongType)) AS PartialSum#964L]
Project [ds#958,key#956,key#959]
 ShuffledHashJoin [ds#955], [ds#958], BuildRight
  Exchange (HashPartitioning [ds#955], 200)
   HiveTableScan [key#956,ds#955], (MetastoreRelation dw, src_partitioned1, 
Some(a)), None
  Exchange (HashPartitioning [ds#958], 200)
   HiveTableScan [ds#958,key#959], (MetastoreRelation dw, src_partitioned2, 
Some(b)), None


was (Author: oopsoutofmemory):
Sorry [~lian cheng] , see this:
scala sql(select * from dw.src_view where ds='2' ).queryExecution.analyzed
res258: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Project [s1#943L,s2#944L,ds#945]
 Filter (ds#945 = 2)
  Subquery src_view
   Aggregate [ds#949], [SUM(CAST(key#947, LongType)) AS 
s1#943L,SUM(CAST(key#950, LongType)) AS s2#944L,ds#949 AS ds#945]
Join Inner, Some((ds#946 = ds#949))
 MetastoreRelation dw, src_partitioned1, Some(a)
 MetastoreRelation dw, src_partitioned2, Some(b)

scala sql(select * from dw.src_view where ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view 
where ds='2'
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view  
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join 
`dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table :

[jira] [Comment Edited] (SPARK-8287) Filter not push down through Subquery or View


[ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403
 ] 

Li Sheng edited comment on SPARK-8287 at 6/10/15 11:19 AM:
---

Sorry [~lian cheng] , see this:
scala sql(select * from dw.src_view where ds='2' ).queryExecution.analyzed
res258: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Project [s1#943L,s2#944L,ds#945]
 Filter (ds#945 = 2)
  Subquery src_view
   Aggregate [ds#949], [SUM(CAST(key#947, LongType)) AS 
s1#943L,SUM(CAST(key#950, LongType)) AS s2#944L,ds#949 AS ds#945]
Join Inner, Some((ds#946 = ds#949))
 MetastoreRelation dw, src_partitioned1, Some(a)
 MetastoreRelation dw, src_partitioned2, Some(b)

scala sql(select * from dw.src_view where ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view 
where ds='2'
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view  
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join 
`dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned1  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned1 
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned2  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned2 
res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Filter (ds#936 = 2)
 Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, 
LongType)) AS s2#935L,ds#940 AS ds#936]
  Project [ds#940,key#938,key#941]
   Join Inner, Some((ds#937 = ds#940))
Project [key#938,ds#937]
 MetastoreRelation dw, src_partitioned1, Some(a)
Project [ds#940,key#941]
 MetastoreRelation dw, src_partitioned2, Some(b)


was (Author: oopsoutofmemory):
Sorry [~lian cheng] , see this:
scala sql(select * from dw.src_view where ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view 
where ds='2'
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view  
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join 
`dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned1  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned1 
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned2  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned2 
res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Filter (ds#936 = 2)
 Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, 
LongType)) AS s2#935L,ds#940 AS ds#936]
  Project [ds#940,key#938,key#941]
   Join Inner, Some((ds#937 = ds#940))
Project [key#938,ds#937]
 MetastoreRelation dw, src_partitioned1, Some(a)
Project [ds#940,key#941]
 MetastoreRelation dw, src_partitioned2, Some(b)

 Filter not push down through Subquery or View
 -

 Key: SPARK-8287
 URL: https://issues.apache.org/jira/browse/SPARK-8287

[jira] [Assigned] (SPARK-8169) Add StopWordsRemover as a transformer


 [ 
https://issues.apache.org/jira/browse/SPARK-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8169:
---

Assignee: (was: Apache Spark)

 Add StopWordsRemover as a transformer
 -

 Key: SPARK-8169
 URL: https://issues.apache.org/jira/browse/SPARK-8169
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng

 StopWordsRemover takes a string array column and outputs a string array 
 column with all defined stop words removed. The transformer should also come 
 with a standard set of stop words as default.
 {code}
 val stopWords = new StopWordsRemover()
   .setInputCol(words)
   .setOutputCol(cleanWords)
   .setStopWords(Array(...)) // optional
 val output = stopWords.transform(df)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8169) Add StopWordsRemover as a transformer


[ 
https://issues.apache.org/jira/browse/SPARK-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580399#comment-14580399
 ] 

Apache Spark commented on SPARK-8169:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/6742

 Add StopWordsRemover as a transformer
 -

 Key: SPARK-8169
 URL: https://issues.apache.org/jira/browse/SPARK-8169
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng

 StopWordsRemover takes a string array column and outputs a string array 
 column with all defined stop words removed. The transformer should also come 
 with a standard set of stop words as default.
 {code}
 val stopWords = new StopWordsRemover()
   .setInputCol(words)
   .setOutputCol(cleanWords)
   .setStopWords(Array(...)) // optional
 val output = stopWords.transform(df)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8287) Filter not push down through Subquery or View


[ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403
 ] 

Li Sheng commented on SPARK-8287:
-

Sorry , see this:

scala sql(select * from dw.src_view_1 where my_ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from 
dw.src_view_1 where my_ds='2'
15/06/10 19:08:59 INFO ParseDriver: Parse Completed
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` 
join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:08:59 INFO ParseDriver: Parse Completed
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned1  
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned1 
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned2  
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned2
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned2 
res254: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Filter (my_ds#918 = 2)
 Aggregate [ds#922], [SUM(CAST(key#920, LongType)) AS s1#916L,SUM(CAST(key#923, 
LongType)) AS s2#917L,ds#922 AS my_ds#918]
  Project [ds#922,key#920,key#923]
   Join Inner, Some((ds#919 = ds#922))
Project [key#920,ds#919]
 MetastoreRelation dw, src_partitioned1, Some(a)
Project [ds#922,key#923]
 MetastoreRelation dw, src_partitioned2, Some(b)


 Filter not push down through Subquery or View
 -

 Key: SPARK-8287
 URL: https://issues.apache.org/jira/browse/SPARK-8287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.4.0

   Original Estimate: 40h
  Remaining Estimate: 40h

 Filter not push down through Subquery or View. Assume we have two big  
 partitioned table join inner a Subquery or a View and filter not push down, 
 this will cause a full partition join and will cause performance issues.
 Let me give and example that can reproduce the problem:
 {code:sql}
 create table src(key int, value string);
 -- Creates partitioned table and imports data
 CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds 
 STRING);
 insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value 
 from src;
 insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value 
 from src;
 CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds 
 STRING);
 insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value 
 from src;
 insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value 
 from src;
 -- Creates views
 create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from 
 src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
 create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds 
 from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
 -- QueryExecution
 select * from dw.src_view where ds='2'
 {code}
 {noformat}
 sql(select * from dw.src_view where ds='2' ).queryExecution
 == Parsed Logical Plan ==
 'Project [*]
  'Filter ('ds = 2)
   'UnresolvedRelation [dw,src_view], None
 == Analyzed Logical Plan ==
 Project [s1#60L,s2#61L,ds#62]
  Filter (ds#62 = 2)
   Subquery src_view
Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
 LongType)) AS s2#61L,ds#66 AS ds#62]
 Join Inner, Some((ds#63 = ds#66))
  MetastoreRelation dw, src_partitioned1, Some(a)
  MetastoreRelation dw, src_partitioned2, Some(b)
 == Optimized Logical Plan ==
 Filter (ds#62 = 2)
  Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
 LongType)) AS s2#61L,ds#66 AS ds#62]
   Project [ds#66,key#64,key#67]
Join Inner, Some((ds#63 = ds#66))
 Project [key#64,ds#63]
  MetastoreRelation dw, s...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail:

[jira] [Comment Edited] (SPARK-8287) Filter not push down through Subquery or View


[ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403
 ] 

Li Sheng edited comment on SPARK-8287 at 6/10/15 11:17 AM:
---

Sorry [~lian cheng] , see this:
scala sql(select * from dw.src_view where ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view 
where ds='2'
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view  
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join 
`dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned1  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned1 
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned2  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned2 
res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Filter (ds#936 = 2)
 Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, 
LongType)) AS s2#935L,ds#940 AS ds#936]
  Project [ds#940,key#938,key#941]
   Join Inner, Some((ds#937 = ds#940))
Project [key#938,ds#937]
 MetastoreRelation dw, src_partitioned1, Some(a)
Project [ds#940,key#941]
 MetastoreRelation dw, src_partitioned2, Some(b)


was (Author: oopsoutofmemory):
Sorry [~lian cheng] , see this:
scala sql(select * from dw.src_view where ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view 
where ds='2'
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view  
15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join 
`dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds`
15/06/10 19:15:39 INFO ParseDriver: Parse Completed
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned1  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned1
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned1 
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_partitioned2  
15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw 
tbl=src_partitioned2
15/06/10 19:15:39 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_partitions : db=dw tbl=src_partitioned2 
res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = 
Filter (ds#936 = 2)
 Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, 
LongType)) AS s2#935L,ds#940 AS ds#936]
  Project [ds#940,key#938,key#941]
   Join Inner, Some((ds#937 = ds#940))
Project [key#938,ds#937]
 MetastoreRelation dw, src_partitioned1, Some(a)
Project [ds#940,key#941]
 MetastoreRelation dw, src_partitioned2, Some(b)

Another one is for subquery , also can not push down the filter:

scala sql(select * from dw.src_view_1 where my_ds='2' 
).queryExecution.optimizedPlan
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from 
dw.src_view_1 where my_ds='2'
15/06/10 19:08:59 INFO ParseDriver: Parse Completed
15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO audit: ugi=shengli   ip=unknown-ip-addr  
cmd=get_table : db=dw tbl=src_view_1
15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) 
`s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` 
join `dw`.`src_partitioned2` `b` on `a`.`ds` =

[jira] [Commented] (SPARK-8287) Filter not push down through Subquery or View


[ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580413#comment-14580413
 ] 

Li Sheng commented on SPARK-8287:
-

I think we can not simply `EliminateSubQueries` in Optimizer, the Subquery 
scope is useful when do optimisation with Subquery related stuff.
It's useful for this scenario.


 Filter not push down through Subquery or View
 -

 Key: SPARK-8287
 URL: https://issues.apache.org/jira/browse/SPARK-8287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.4.0

   Original Estimate: 40h
  Remaining Estimate: 40h

 Filter not push down through Subquery or View. Assume we have two big  
 partitioned table join inner a Subquery or a View and filter not push down, 
 this will cause a full partition join and will cause performance issues.
 Let me give and example that can reproduce the problem:
 {code:sql}
 create table src(key int, value string);
 -- Creates partitioned table and imports data
 CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds 
 STRING);
 insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value 
 from src;
 insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value 
 from src;
 CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds 
 STRING);
 insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value 
 from src;
 insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value 
 from src;
 -- Creates views
 create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from 
 src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
 create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds 
 from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
 -- QueryExecution
 select * from dw.src_view where ds='2'
 {code}
 {noformat}
 sql(select * from dw.src_view where ds='2' ).queryExecution
 == Parsed Logical Plan ==
 'Project [*]
  'Filter ('ds = 2)
   'UnresolvedRelation [dw,src_view], None
 == Analyzed Logical Plan ==
 Project [s1#60L,s2#61L,ds#62]
  Filter (ds#62 = 2)
   Subquery src_view
Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
 LongType)) AS s2#61L,ds#66 AS ds#62]
 Join Inner, Some((ds#63 = ds#66))
  MetastoreRelation dw, src_partitioned1, Some(a)
  MetastoreRelation dw, src_partitioned2, Some(b)
 == Optimized Logical Plan ==
 Filter (ds#62 = 2)
  Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
 LongType)) AS s2#61L,ds#66 AS ds#62]
   Project [ds#66,key#64,key#67]
Join Inner, Some((ds#63 = ds#66))
 Project [key#64,ds#63]
  MetastoreRelation dw, s...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8275) HistoryServer caches incomplete App UIs

2015-06-10 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579273#comment-14579273
 ] 

Steve Loughran edited comment on SPARK-8275 at 6/10/15 11:43 AM:
-

The cache code is from google; history server provides a method to get the data 
for an entry, but there's no logic in the cache itself to have a refresh time 
on entries.

One solution would be
# cache entries to include a timestamp and completed flag alongside the SparkUI 
instances
# direct all cache.get operations through a single method in HistoryServer
# have that method do something like

{code}
def getUI(id: String): SparkUI = {
 var cacheEntry = cache.get(id)
 if (!cacheEntry.completed  (cacheEntry.timestamp + expiryTime)  now()) {
  cache.release(id)
  cache.get(id)
 }
 cacheEntry
}
{code}
this will leave out of date entries in the cache, but on any retrieval trigger 
the rebuild.


was (Author: ste...@apache.org):
The cache code is from google; history server provides a method to get the data 
for an entry, but there's no logic in the cache itself to have a refresh time 
on entries.

One solution would be
# cache entries to include a timestamp and completed flag alongside the SparkUI 
instances
# direct all cache.get operations through a single method in HistoryServer
# have that method do something like

{code}
def getUI(id: String): SparkUI = {
 var cacheEntry = cache.get(id)
 if (!cacheEntry.completed  (cacheEntry.timestamp + expiryTime)  now()) {
  cache.release(id)
  cache.get(id)
 }
 cacheEntry
}

this will leave out of date entries in the cache, but on any retrieval trigger 
the rebuild.

 HistoryServer caches incomplete App UIs
 ---

 Key: SPARK-8275
 URL: https://issues.apache.org/jira/browse/SPARK-8275
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.1
Reporter: Steve Loughran

 The history server caches applications retrieved from the 
 {{ApplicationHistoryProvider.getAppUI()}} call for performance: it's 
 expensive to rebuild.
 However, this cache also includes incomplete applications, as well as 
 completed ones —and it never attempts to refresh the incomplete application.
 As a result, if you do a GET of the history of a running application, even 
 after the application is finished, you'll still get the web UI/history as it 
 was when that first GET was issued.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-06-10 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-1537:
--
Comment: was deleted

(was: Full application log. Application hasn't actually stopped, which is 
interesting.
{code}
$ dist/bin/spark-submit  \
--class 
org.apache.spark.examples.SparkPi \
--properties-file 
../clusterconfigs/clusters/devix/spark/spark-defaults.conf \
--master yarn-client \
--executor-memory 128m \
--num-executors 1 \
--executor-cores 1 \
--driver-memory 128m \

dist/lib/spark-examples-1.5.0-SNAPSHOT-hadoop2.6.0.jar 12
2015-06-09 17:01:59,596 [main] INFO  spark.SparkContext 
(Logging.scala:logInfo(59)) - Running Spark version 1.5.0-SNAPSHOT
2015-06-09 17:02:01,309 [sparkDriver-akka.actor.default-dispatcher-2] INFO  
slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started
2015-06-09 17:02:01,359 [sparkDriver-akka.actor.default-dispatcher-2] INFO  
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting
2015-06-09 17:02:01,542 [sparkDriver-akka.actor.default-dispatcher-2] INFO  
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on 
addresses :[akka.tcp://sparkDriver@192.168.1.86:51476]
2015-06-09 17:02:01,549 [main] INFO  util.Utils (Logging.scala:logInfo(59)) - 
Successfully started service 'sparkDriver' on port 51476.
2015-06-09 17:02:01,568 [main] INFO  spark.SparkEnv (Logging.scala:logInfo(59)) 
- Registering MapOutputTracker
2015-06-09 17:02:01,587 [main] INFO  spark.SparkEnv (Logging.scala:logInfo(59)) 
- Registering BlockManagerMaster
2015-06-09 17:02:01,831 [main] INFO  spark.HttpServer 
(Logging.scala:logInfo(59)) - Starting HTTP Server
2015-06-09 17:02:01,891 [main] INFO  util.Utils (Logging.scala:logInfo(59)) - 
Successfully started service 'HTTP file server' on port 51477.
2015-06-09 17:02:01,905 [main] INFO  spark.SparkEnv (Logging.scala:logInfo(59)) 
- Registering OutputCommitCoordinator
2015-06-09 17:02:02,038 [main] INFO  util.Utils (Logging.scala:logInfo(59)) - 
Successfully started service 'SparkUI' on port 4040.
2015-06-09 17:02:02,039 [main] INFO  ui.SparkUI (Logging.scala:logInfo(59)) - 
Started SparkUI at http://192.168.1.86:4040
2015-06-09 17:02:03,071 [main] INFO  spark.SparkContext 
(Logging.scala:logInfo(59)) - Added JAR 
file:/Users/stevel/Projects/Hortonworks/Projects/sparkwork/spark/dist/lib/spark-examples-1.5.0-SNAPSHOT-hadoop2.6.0.jar
 at 
http://192.168.1.86:51477/jars/spark-examples-1.5.0-SNAPSHOT-hadoop2.6.0.jar 
with timestamp 1433865723062
2015-06-09 17:02:03,691 [main] INFO  impl.TimelineClientImpl 
(TimelineClientImpl.java:serviceInit(285)) - Timeline service address: 
http://devix.cotham.uk:8188/ws/v1/timeline/
2015-06-09 17:02:03,808 [main] INFO  client.RMProxy 
(RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at 
devix.cotham.uk/192.168.1.134:8050
2015-06-09 17:02:04,577 [main] INFO  yarn.Client (Logging.scala:logInfo(59)) - 
Requesting a new application from cluster with 1 NodeManagers
2015-06-09 17:02:04,637 [main] INFO  yarn.Client (Logging.scala:logInfo(59)) - 
Verifying our application has not requested more than the maximum memory 
capability of the cluster (2048 MB per container)
2015-06-09 17:02:04,637 [main] INFO  yarn.Client (Logging.scala:logInfo(59)) - 
Will allocate AM container, with 896 MB memory including 384 MB overhead
2015-06-09 17:02:04,638 [main] INFO  yarn.Client (Logging.scala:logInfo(59)) - 
Setting up container launch context for our AM
2015-06-09 17:02:04,643 [main] INFO  yarn.Client (Logging.scala:logInfo(59)) - 
Preparing resources for our AM container
2015-06-09 17:02:05,096 [main] WARN  shortcircuit.DomainSocketFactory 
(DomainSocketFactory.java:init(116)) - The short-circuit local reads feature 
cannot be used because libhadoop cannot be loaded.
2015-06-09 17:02:05,106 [main] DEBUG yarn.YarnSparkHadoopUtil 
(Logging.scala:logDebug(63)) - delegation token renewer is: 
rm/devix.cotham.uk@COTHAM
2015-06-09 17:02:05,107 [main] INFO  yarn.YarnSparkHadoopUtil 
(Logging.scala:logInfo(59)) - getting token for namenode: 
hdfs://devix.cotham.uk:8020/user/stevel/.sparkStaging/application_1433777033372_0005
2015-06-09 17:02:06,129 [main] DEBUG yarn.Client (Logging.scala:logDebug(63)) - 
HiveMetaStore configured in localmode
2015-06-09 17:02:06,130 [main] DEBUG yarn.Client (Logging.scala:logDebug(63)) - 
HBase Class not found: java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.HBaseConfiguration
2015-06-09 17:02:06,225 [main] INFO  yarn.Client

[jira] [Commented] (SPARK-7756) Ensure Spark runs clean on IBM Java implementation


[ 
https://issues.apache.org/jira/browse/SPARK-7756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580237#comment-14580237
 ] 

Apache Spark commented on SPARK-7756:
-

User 'a-roberts' has created a pull request for this issue:
https://github.com/apache/spark/pull/6740

 Ensure Spark runs clean on IBM Java implementation
 --

 Key: SPARK-7756
 URL: https://issues.apache.org/jira/browse/SPARK-7756
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Tim Ellison
Assignee: Tim Ellison
Priority: Minor
 Fix For: 1.4.0


 Spark should run successfully on the IBM Java implementation.  This issue is 
 to gather any minor issues seen running the tests and examples that are 
 attributable to differences in Java vendor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8289) Provide a specific stack size with all Java implementations to prevent stack overflows with certain tests

2015-06-10 Thread Adam Roberts (JIRA)

Adam Roberts created SPARK-8289:
---

 Summary: Provide a specific stack size with all Java 
implementations to prevent stack overflows with certain tests
 Key: SPARK-8289
 URL: https://issues.apache.org/jira/browse/SPARK-8289
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
 Environment: Anywhere whereby the Java vendor is not OpenJDK
Reporter: Adam Roberts
 Fix For: 1.5.0


Default stack sizes differ per Java implementation - so tests can pass for 
those with higher stack sizes (OpenJDK) but will fail with Oracle or IBM Java 
owing to lower default sizes. In particular we can see this happening with the 
JavaALSSuite - with 15 iterations, we get stackoverflow errors with Oracle and 
IBM Java. We don't with OpenJDK. This JIRA aims to address such an issue by 
providing a default specified stack size to be used for all Java distributions: 
4096k specified for both SBT test args and for Maven test args (changing 
project/ScalaBuild.scala and pom.xml respectively). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8288) ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type


[ 
https://issues.apache.org/jira/browse/SPARK-8288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580244#comment-14580244
 ] 

Cheng Lian commented on SPARK-8288:
---

cc [~zzztimbo]

 ScalaReflection should also try apply methods defined in companion objects 
 when inferring schema from a Product type
 

 Key: SPARK-8288
 URL: https://issues.apache.org/jira/browse/SPARK-8288
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian

 This ticket is derived from PARQUET-293 (which actually describes a Spark SQL 
 issue).
 My comment on that issue quoted below:
 {quote}
 ...  The reason of this exception is that, the Scala code Scrooge generates 
 is actually a trait extending {{Product}}:
 {code}
 trait Junk
   extends ThriftStruct
   with scala.Product2[Long, String]
   with java.io.Serializable
 {code}
 while Spark expects a case class, something like:
 {code}
 case class Junk(junkID: Long, junkString: String)
 {code}
 The key difference here is that the latter case class version has a 
 constructor whose arguments can be transformed into fields of the DataFrame 
 schema.  The exception was thrown because Spark can't find such a constructor 
 from trait {{Junk}}.
 {quote}
 We can make {{ScalaReflection}} try {{apply}} methods in companion objects, 
 so that trait types generated by Scrooge can also be used for Spark SQL 
 schema inference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8286) Rewrite UTF8String in Java and move it into unsafe package.


[ 
https://issues.apache.org/jira/browse/SPARK-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580160#comment-14580160
 ] 

Apache Spark commented on SPARK-8286:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6738

 Rewrite UTF8String in Java and move it into unsafe package.
 ---

 Key: SPARK-8286
 URL: https://issues.apache.org/jira/browse/SPARK-8286
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8286) Rewrite UTF8String in Java and move it into unsafe package.


 [ 
https://issues.apache.org/jira/browse/SPARK-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8286:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Rewrite UTF8String in Java and move it into unsafe package.
 ---

 Key: SPARK-8286
 URL: https://issues.apache.org/jira/browse/SPARK-8286
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8286) Rewrite UTF8String in Java and move it into unsafe package.


 [ 
https://issues.apache.org/jira/browse/SPARK-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8286:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Rewrite UTF8String in Java and move it into unsafe package.
 ---

 Key: SPARK-8286
 URL: https://issues.apache.org/jira/browse/SPARK-8286
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7886) Add built-in expressions to FunctionRegistry


 [ 
https://issues.apache.org/jira/browse/SPARK-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7886:
---
Assignee: Reynold Xin

 Add built-in expressions to FunctionRegistry
 

 Key: SPARK-7886
 URL: https://issues.apache.org/jira/browse/SPARK-7886
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Fix For: 1.5.0


 Once we do this, we no longer needs to hardcode expressions into the parser 
 (both for internal SQL and Hive QL).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7886) Add built-in expressions to FunctionRegistry


[ 
https://issues.apache.org/jira/browse/SPARK-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580175#comment-14580175
 ] 

Apache Spark commented on SPARK-7886:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6739

 Add built-in expressions to FunctionRegistry
 

 Key: SPARK-7886
 URL: https://issues.apache.org/jira/browse/SPARK-7886
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Fix For: 1.5.0


 Once we do this, we no longer needs to hardcode expressions into the parser 
 (both for internal SQL and Hive QL).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4176) Support decimals with precision 18 in Parquet

2015-06-10 Thread Rene Treffer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580206#comment-14580206
 ] 

Rene Treffer commented on SPARK-4176:
-

Given that SPARK-7897 loads unsigned bigint as DecimalType.unlimited means 
you can now load tables with unsigned bigint but won't be able to store them in 
parquet files :-(

 Support decimals with precision  18 in Parquet
 ---

 Key: SPARK-4176
 URL: https://issues.apache.org/jira/browse/SPARK-4176
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Matei Zaharia
Assignee: Cheng Lian

 After https://issues.apache.org/jira/browse/SPARK-3929, only decimals with 
 precisions = 18 (that can be read into a Long) will be readable from 
 Parquet, so we still need more work to support these larger ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4127) Streaming Linear Regression- Python bindings


 [ 
https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4127:
---

Assignee: Apache Spark

 Streaming Linear Regression- Python bindings
 

 Key: SPARK-4127
 URL: https://issues.apache.org/jira/browse/SPARK-4127
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Anant Daksh Asthana
Assignee: Apache Spark

 Create python bindings for Streaming Linear Regression (MLlib).
 The Mllib file relevant to this issue can be found at : 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4127) Streaming Linear Regression- Python bindings


 [ 
https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4127:
---

Assignee: (was: Apache Spark)

 Streaming Linear Regression- Python bindings
 

 Key: SPARK-4127
 URL: https://issues.apache.org/jira/browse/SPARK-4127
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Anant Daksh Asthana

 Create python bindings for Streaming Linear Regression (MLlib).
 The Mllib file relevant to this issue can be found at : 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4127) Streaming Linear Regression- Python bindings


[ 
https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580507#comment-14580507
 ] 

Apache Spark commented on SPARK-4127:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/6744

 Streaming Linear Regression- Python bindings
 

 Key: SPARK-4127
 URL: https://issues.apache.org/jira/browse/SPARK-4127
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Anant Daksh Asthana

 Create python bindings for Streaming Linear Regression (MLlib).
 The Mllib file relevant to this issue can be found at : 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8289) Provide a specific stack size with all Java implementations to prevent stack overflows with certain tests


 [ 
https://issues.apache.org/jira/browse/SPARK-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8289:
---

Assignee: (was: Apache Spark)

 Provide a specific stack size with all Java implementations to prevent stack 
 overflows with certain tests
 -

 Key: SPARK-8289
 URL: https://issues.apache.org/jira/browse/SPARK-8289
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
 Environment: Anywhere whereby the Java vendor is not OpenJDK
Reporter: Adam Roberts
 Fix For: 1.5.0


 Default stack sizes differ per Java implementation - so tests can pass for 
 those with higher stack sizes (OpenJDK) but will fail with Oracle or IBM Java 
 owing to lower default sizes. In particular we can see this happening with 
 the JavaALSSuite - with 15 iterations, we get stackoverflow errors with 
 Oracle and IBM Java. We don't with OpenJDK. This JIRA aims to address such an 
 issue by providing a default specified stack size to be used for all Java 
 distributions: 4096k specified for both SBT test args and for Maven test args 
 (changing project/ScalaBuild.scala and pom.xml respectively). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8289) Provide a specific stack size with all Java implementations to prevent stack overflows with certain tests


 [ 
https://issues.apache.org/jira/browse/SPARK-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8289:
---

Assignee: Apache Spark

 Provide a specific stack size with all Java implementations to prevent stack 
 overflows with certain tests
 -

 Key: SPARK-8289
 URL: https://issues.apache.org/jira/browse/SPARK-8289
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
 Environment: Anywhere whereby the Java vendor is not OpenJDK
Reporter: Adam Roberts
Assignee: Apache Spark
 Fix For: 1.5.0


 Default stack sizes differ per Java implementation - so tests can pass for 
 those with higher stack sizes (OpenJDK) but will fail with Oracle or IBM Java 
 owing to lower default sizes. In particular we can see this happening with 
 the JavaALSSuite - with 15 iterations, we get stackoverflow errors with 
 Oracle and IBM Java. We don't with OpenJDK. This JIRA aims to address such an 
 issue by providing a default specified stack size to be used for all Java 
 distributions: 4096k specified for both SBT test args and for Maven test args 
 (changing project/ScalaBuild.scala and pom.xml respectively). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7815) Enable UTF8String to work against memory address directly


 [ 
https://issues.apache.org/jira/browse/SPARK-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7815:
---
Summary: Enable UTF8String to work against memory address directly  (was: 
Move UTF8String into Unsafe java package, and have it work against memory 
address directly)

 Enable UTF8String to work against memory address directly
 -

 Key: SPARK-7815
 URL: https://issues.apache.org/jira/browse/SPARK-7815
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 So we can avoid an extra copy of data into byte array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8286) Rewrite UTF8String in Java and move it into unsafe package.

Reynold Xin created SPARK-8286:
--

 Summary: Rewrite UTF8String in Java and move it into unsafe 
package.
 Key: SPARK-8286
 URL: https://issues.apache.org/jira/browse/SPARK-8286
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS

Tao Wang created SPARK-8290:
---

 Summary: spark class command builder need read SPARK_JAVA_OPTS
 Key: SPARK-8290
 URL: https://issues.apache.org/jira/browse/SPARK-8290
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Tao Wang
Priority: Minor


SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
it back so spark-class could read it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8286) Rewrite UTF8String in Java and move it into unsafe package.


 [ 
https://issues.apache.org/jira/browse/SPARK-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8286:
--

Assignee: Reynold Xin

 Rewrite UTF8String in Java and move it into unsafe package.
 ---

 Key: SPARK-8286
 URL: https://issues.apache.org/jira/browse/SPARK-8286
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8288) ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type

Cheng Lian created SPARK-8288:
-

 Summary: ScalaReflection should also try apply methods defined in 
companion objects when inferring schema from a Product type
 Key: SPARK-8288
 URL: https://issues.apache.org/jira/browse/SPARK-8288
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian


This ticket is derived from PARQUET-293 (which actually describes a Spark SQL 
issue).

My comment on that issue quoted below:
{quote}
...  The reason of this exception is that, the Scala code Scrooge generates is 
actually a trait extending {{Product}}:
{code}
trait Junk
  extends ThriftStruct
  with scala.Product2[Long, String]
  with java.io.Serializable
{code}
while Spark expects a case class, something like:
{code}
case class Junk(junkID: Long, junkString: String)
{code}
The key difference here is that the latter case class version has a constructor 
whose arguments can be transformed into fields of the DataFrame schema.  The 
exception was thrown because Spark can't find such a constructor from trait 
{{Junk}}.
{quote}
We can make {{ScalaReflection}} try {{apply}} methods in companion objects, so 
that trait types generated by Scrooge can also be used for Spark SQL schema 
inference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8287) Filter not push down through Subquery or View


[ 
https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580257#comment-14580257
 ] 

Li Sheng commented on SPARK-8287:
-

[~marmbrus] [~lian cheng] 

 Filter not push down through Subquery or View
 -

 Key: SPARK-8287
 URL: https://issues.apache.org/jira/browse/SPARK-8287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.4.0

   Original Estimate: 40h
  Remaining Estimate: 40h

 Filter not push down through Subquery or View. Assume we have two big  
 partitioned table join inner a Subquery or a View and filter not push down, 
 this will cause a full partition join and will cause performance issues.
 Let me give and example that can reproduce the problem:
 create table src(key int, value string);
 --创建分区表并且导入数据
 CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds 
 STRING);
 insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value 
 from src;
 insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value 
 from src;
 CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds 
 STRING);
 insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value 
 from src;
 insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value 
 from src;
 --创建视图
 create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from 
 src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
 create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds 
 from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds
 --QueryExecution
 select * from dw.src_view where ds='2'
 sql(select * from dw.src_view where ds='2' ).queryExecution
 == Parsed Logical Plan ==
 'Project [*]
  'Filter ('ds = 2)
   'UnresolvedRelation [dw,src_view], None
 == Analyzed Logical Plan ==
 Project [s1#60L,s2#61L,ds#62]
  Filter (ds#62 = 2)
   Subquery src_view
Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
 LongType)) AS s2#61L,ds#66 AS ds#62]
 Join Inner, Some((ds#63 = ds#66))
  MetastoreRelation dw, src_partitioned1, Some(a)
  MetastoreRelation dw, src_partitioned2, Some(b)
 == Optimized Logical Plan ==
 Filter (ds#62 = 2)
  Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
 LongType)) AS s2#61L,ds#66 AS ds#62]
   Project [ds#66,key#64,key#67]
Join Inner, Some((ds#63 = ds#66))
 Project [key#64,ds#63]
  MetastoreRelation dw, s...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8287) Filter not push down through Subquery or View

Li Sheng created SPARK-8287:
---

 Summary: Filter not push down through Subquery or View
 Key: SPARK-8287
 URL: https://issues.apache.org/jira/browse/SPARK-8287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.4.0


Filter not push down through Subquery or View. Assume we have two big  
partitioned table join inner a Subquery or a View and filter not push down, 
this will cause a full partition join and will cause performance issues.

Let me give and example that can reproduce the problem:

create table src(key int, value string);

--创建分区表并且导入数据
CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds 
STRING);

insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value 
from src;
insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value 
from src;

CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds 
STRING);

insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value 
from src;
insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value 
from src;

--创建视图
create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from 
src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds


create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from 
src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds


--QueryExecution
select * from dw.src_view where ds='2'


sql(select * from dw.src_view where ds='2' ).queryExecution

== Parsed Logical Plan ==
'Project [*]
 'Filter ('ds = 2)
  'UnresolvedRelation [dw,src_view], None

== Analyzed Logical Plan ==
Project [s1#60L,s2#61L,ds#62]
 Filter (ds#62 = 2)
  Subquery src_view
   Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
LongType)) AS s2#61L,ds#66 AS ds#62]
Join Inner, Some((ds#63 = ds#66))
 MetastoreRelation dw, src_partitioned1, Some(a)
 MetastoreRelation dw, src_partitioned2, Some(b)

== Optimized Logical Plan ==
Filter (ds#62 = 2)
 Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, 
LongType)) AS s2#61L,ds#66 AS ds#62]
  Project [ds#66,key#64,key#67]
   Join Inner, Some((ds#63 = ds#66))
Project [key#64,ds#63]
 MetastoreRelation dw, s...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS


 [ 
https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8290:
---

Assignee: (was: Apache Spark)

 spark class command builder need read SPARK_JAVA_OPTS
 -

 Key: SPARK-8290
 URL: https://issues.apache.org/jira/browse/SPARK-8290
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Tao Wang
Priority: Minor

 SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
 it back so spark-class could read it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS


 [ 
https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8290:
---

Assignee: Apache Spark

 spark class command builder need read SPARK_JAVA_OPTS
 -

 Key: SPARK-8290
 URL: https://issues.apache.org/jira/browse/SPARK-8290
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Tao Wang
Assignee: Apache Spark
Priority: Minor

 SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
 it back so spark-class could read it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS


[ 
https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580290#comment-14580290
 ] 

Apache Spark commented on SPARK-8290:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/6741

 spark class command builder need read SPARK_JAVA_OPTS
 -

 Key: SPARK-8290
 URL: https://issues.apache.org/jira/browse/SPARK-8290
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Tao Wang
Priority: Minor

 SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
 it back so spark-class could read it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8291) Add parse functionality to LabeledPoint in PySpark

2015-06-10 Thread Manoj Kumar (JIRA)

Manoj Kumar created SPARK-8291:
--

 Summary: Add parse functionality to LabeledPoint in PySpark
 Key: SPARK-8291
 URL: https://issues.apache.org/jira/browse/SPARK-8291
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor


It is useful to have functionality that can parse a string into a LabeledPoint 
while loading files, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8292) ShortestPaths run with error result

2015-06-10 Thread Bruce Chen (JIRA)

Bruce Chen created SPARK-8292:
-

 Summary: ShortestPaths run with error result
 Key: SPARK-8292
 URL: https://issues.apache.org/jira/browse/SPARK-8292
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.1
 Environment: Ubuntu 64bit 
Reporter: Bruce Chen
 Fix For: 1.3.1


In graphx/lib/ShortestPaths, i run an example with input data:
0\t2
0\t4
2\t3
3\t6
4\t2
4\t5
5\t3
5\t6

then i write a function and set point '0' as the source point, and calculate 
the shortest path from point 0 to the others points, the code like this:

val source: Seq[VertexId] = Seq(0)
val ss = ShortestPaths.run(graph, source)

then, i get the run result of all the vertex's shortest path value:

(4,Map())
(0,Map(0 - 0))
(6,Map())
(3,Map())
(5,Map())
(2,Map())

but the right result should be:

(4,Map(0 - 1))
(0,Map(0 - 0))
(6,Map(0 - 3))
(3,Map(0 - 2))
(5,Map(0 - 2))
(2,Map(0 - 1))

so, i check the source code of 
spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala 
and find a bug.
The patch list in the following.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8291) Add parse functionality to LabeledPoint in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8291:
---

Assignee: Apache Spark

 Add parse functionality to LabeledPoint in PySpark
 --

 Key: SPARK-8291
 URL: https://issues.apache.org/jira/browse/SPARK-8291
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Apache Spark
Priority: Minor

 It is useful to have functionality that can parse a string into a 
 LabeledPoint while loading files, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7179) Add pattern after show tables to filter desire tablename


[ 
https://issues.apache.org/jira/browse/SPARK-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580527#comment-14580527
 ] 

Apache Spark commented on SPARK-7179:
-

User 'baishuo' has created a pull request for this issue:
https://github.com/apache/spark/pull/6745

 Add pattern after show tables to filter desire tablename
 --

 Key: SPARK-7179
 URL: https://issues.apache.org/jira/browse/SPARK-7179
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: baishuo
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7179) Add pattern after show tables to filter desire tablename


 [ 
https://issues.apache.org/jira/browse/SPARK-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7179:
---

Assignee: (was: Apache Spark)

 Add pattern after show tables to filter desire tablename
 --

 Key: SPARK-7179
 URL: https://issues.apache.org/jira/browse/SPARK-7179
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: baishuo
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8291) Add parse functionality to LabeledPoint in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8291:
---

Assignee: (was: Apache Spark)

 Add parse functionality to LabeledPoint in PySpark
 --

 Key: SPARK-8291
 URL: https://issues.apache.org/jira/browse/SPARK-8291
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor

 It is useful to have functionality that can parse a string into a 
 LabeledPoint while loading files, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8291) Add parse functionality to LabeledPoint in PySpark


[ 
https://issues.apache.org/jira/browse/SPARK-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580546#comment-14580546
 ] 

Apache Spark commented on SPARK-8291:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/6746

 Add parse functionality to LabeledPoint in PySpark
 --

 Key: SPARK-8291
 URL: https://issues.apache.org/jira/browse/SPARK-8291
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor

 It is useful to have functionality that can parse a string into a 
 LabeledPoint while loading files, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8292) ShortestPaths run with error result

2015-06-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8292:
-
Priority: Minor  (was: Major)
Target Version/s:   (was: 1.3.1)
   Fix Version/s: (was: 1.3.1)

[~hangc0276] Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  This 
is not how you propose a patch and you've set some fields that shouldn't be 
here.
You should also explain the fix.

 ShortestPaths run with error result
 ---

 Key: SPARK-8292
 URL: https://issues.apache.org/jira/browse/SPARK-8292
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.1
 Environment: Ubuntu 64bit 
Reporter: Bruce Chen
Priority: Minor
  Labels: patch
 Attachments: ShortestPaths.patch


 In graphx/lib/ShortestPaths, i run an example with input data:
 0\t2
 0\t4
 2\t3
 3\t6
 4\t2
 4\t5
 5\t3
 5\t6
 then i write a function and set point '0' as the source point, and calculate 
 the shortest path from point 0 to the others points, the code like this:
 val source: Seq[VertexId] = Seq(0)
 val ss = ShortestPaths.run(graph, source)
 then, i get the run result of all the vertex's shortest path value:
 (4,Map())
 (0,Map(0 - 0))
 (6,Map())
 (3,Map())
 (5,Map())
 (2,Map())
 but the right result should be:
 (4,Map(0 - 1))
 (0,Map(0 - 0))
 (6,Map(0 - 3))
 (3,Map(0 - 2))
 (5,Map(0 - 2))
 (2,Map(0 - 1))
 so, i check the source code of 
 spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala 
 and find a bug.
 The patch list in the following.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7179) Add pattern after show tables to filter desire tablename


 [ 
https://issues.apache.org/jira/browse/SPARK-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7179:
---

Assignee: Apache Spark

 Add pattern after show tables to filter desire tablename
 --

 Key: SPARK-7179
 URL: https://issues.apache.org/jira/browse/SPARK-7179
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: baishuo
Assignee: Apache Spark
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8292) ShortestPaths run with error result

2015-06-10 Thread Bruce Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Chen updated SPARK-8292:
--
Attachment: ShortestPaths.patch

 ShortestPaths run with error result
 ---

 Key: SPARK-8292
 URL: https://issues.apache.org/jira/browse/SPARK-8292
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.1
 Environment: Ubuntu 64bit 
Reporter: Bruce Chen
  Labels: patch
 Fix For: 1.3.1

 Attachments: ShortestPaths.patch


 In graphx/lib/ShortestPaths, i run an example with input data:
 0\t2
 0\t4
 2\t3
 3\t6
 4\t2
 4\t5
 5\t3
 5\t6
 then i write a function and set point '0' as the source point, and calculate 
 the shortest path from point 0 to the others points, the code like this:
 val source: Seq[VertexId] = Seq(0)
 val ss = ShortestPaths.run(graph, source)
 then, i get the run result of all the vertex's shortest path value:
 (4,Map())
 (0,Map(0 - 0))
 (6,Map())
 (3,Map())
 (5,Map())
 (2,Map())
 but the right result should be:
 (4,Map(0 - 1))
 (0,Map(0 - 0))
 (6,Map(0 - 3))
 (3,Map(0 - 2))
 (5,Map(0 - 2))
 (2,Map(0 - 1))
 so, i check the source code of 
 spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala 
 and find a bug.
 The patch list in the following.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8272) BigDecimal in parquet not working

2015-06-10 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580583#comment-14580583
 ] 

Liang-Chi Hsieh commented on SPARK-8272:


Can you provide more information? Such as example codes, your data schema, etc.

 BigDecimal in parquet not working
 -

 Key: SPARK-8272
 URL: https://issues.apache.org/jira/browse/SPARK-8272
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.3.1
 Environment: Ubuntu 14.0 LTS
Reporter: Bipin Roshan Nag
  Labels: sparksql

 When trying to save a DDF to parquet file I get the following errror:
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
 (TID 311, localhost): java.lang.ClassCastException: scala.runtime.BoxedUnit 
 cannot be cast to org.apache.spark.sql.types.Decimal
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
   at 
 parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:671)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 I cannot save the dataframe. Please help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8284) Regualarized Extreme Learning Machine for MLLib

2015-06-10 Thread Rakesh Chalasani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580602#comment-14580602
 ] 

Rakesh Chalasani commented on SPARK-8284:
-

In my experience, the performance and ways to build ELMs is not very well 
understood to include in MLlib at this time. So, it is better to put this in 
Spark packages.

See:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines

 Regualarized Extreme Learning Machine for MLLib
 ---

 Key: SPARK-8284
 URL: https://issues.apache.org/jira/browse/SPARK-8284
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.1
Reporter: 李力

 Extreme Learning Machine can get better generalization performance at a much 
 faster learning speed for regression and classification problem，but the 
 enlarging volume of datasets makes regression by ELM on very large scale 
 datasets a challenging task.
 Through analyzing the mechanism of ELM algorithm , an efficient parallel ELM 
 for regression is designed and implemented based on Spark.
 The experimental results demonstrate that the propose parallel ELM for 
 regression can efficiently handle very large dataset with a good performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7961) Redesign SQLConf for better error message reporting


 [ 
https://issues.apache.org/jira/browse/SPARK-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7961:
---

Assignee: Apache Spark

 Redesign SQLConf for better error message reporting
 ---

 Key: SPARK-7961
 URL: https://issues.apache.org/jira/browse/SPARK-7961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Critical

 Right now, we don't validate config values and as a result will throw 
 exceptions when queries or DataFrame operations are run.
 Imagine if one user sets config variable spark.sql.retainGroupColumns 
 (requires true, false) to hello. The set action itself will complete 
 fine. When another user runs a query, it will throw the following exception:
 {code}
 java.lang.IllegalArgumentException: For input string: hello
 at 
 scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:238)
 at 
 scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:226)
 at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:31)
 at 
 org.apache.spark.sql.SQLConf.dataFrameRetainGroupColumns(SQLConf.scala:265)
 at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:74)
 at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:227)
 {code}
 This is highly confusing. We should redesign SQLConf to validate data input 
 at set time (during setConf call).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7961) Redesign SQLConf for better error message reporting


 [ 
https://issues.apache.org/jira/browse/SPARK-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7961:
---

Assignee: (was: Apache Spark)

 Redesign SQLConf for better error message reporting
 ---

 Key: SPARK-7961
 URL: https://issues.apache.org/jira/browse/SPARK-7961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 Right now, we don't validate config values and as a result will throw 
 exceptions when queries or DataFrame operations are run.
 Imagine if one user sets config variable spark.sql.retainGroupColumns 
 (requires true, false) to hello. The set action itself will complete 
 fine. When another user runs a query, it will throw the following exception:
 {code}
 java.lang.IllegalArgumentException: For input string: hello
 at 
 scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:238)
 at 
 scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:226)
 at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:31)
 at 
 org.apache.spark.sql.SQLConf.dataFrameRetainGroupColumns(SQLConf.scala:265)
 at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:74)
 at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:227)
 {code}
 This is highly confusing. We should redesign SQLConf to validate data input 
 at set time (during setConf call).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8273) Driver hangs up when yarn shutdown in client mode

2015-06-10 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-8273:
---
Affects Version/s: 1.4.0
   1.3.1

 Driver hangs up when yarn shutdown in client mode
 -

 Key: SPARK-8273
 URL: https://issues.apache.org/jira/browse/SPARK-8273
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.3.1, 1.4.0
Reporter: Tao Wang

 In client mode, if yarn was shut down with spark application running, the 
 application will hang up after several retries(default: 30) because the 
 exception throwed by YarnClientImpl could not be caught by upper level, we 
 should exit in case that user can not be aware that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8018) KMeans should accept initial cluster centers as param


 [ 
https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8018:
---

Assignee: Apache Spark  (was: Meethu Mathew)

 KMeans should accept initial cluster centers as param
 -

 Key: SPARK-8018
 URL: https://issues.apache.org/jira/browse/SPARK-8018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 KMeans should allow model initialization using an existing set of cluster 
 centers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8018) KMeans should accept initial cluster centers as param


[ 
https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580128#comment-14580128
 ] 

Apache Spark commented on SPARK-8018:
-

User 'FlytxtRnD' has created a pull request for this issue:
https://github.com/apache/spark/pull/6737

 KMeans should accept initial cluster centers as param
 -

 Key: SPARK-8018
 URL: https://issues.apache.org/jira/browse/SPARK-8018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Meethu Mathew

 KMeans should allow model initialization using an existing set of cluster 
 centers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8018) KMeans should accept initial cluster centers as param


 [ 
https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8018:
---

Assignee: Meethu Mathew  (was: Apache Spark)

 KMeans should accept initial cluster centers as param
 -

 Key: SPARK-8018
 URL: https://issues.apache.org/jira/browse/SPARK-8018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Meethu Mathew

 KMeans should allow model initialization using an existing set of cluster 
 centers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly


 [ 
https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-8290:

Description: 
SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
it back so spark-class could read it.

The missing part is here: 
https://github.com/apache/spark/blob/1c30afdf94b27e1ad65df0735575306e65d148a1/bin/spark-class#L97.

  was:
SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
it back so spark-class could read it.

The missing part is here.


 spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY 
 properly
 --

 Key: SPARK-8290
 URL: https://issues.apache.org/jira/browse/SPARK-8290
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Tao Wang
Priority: Minor

 SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
 it back so spark-class could read it.
 The missing part is here: 
 https://github.com/apache/spark/blob/1c30afdf94b27e1ad65df0735575306e65d148a1/bin/spark-class#L97.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8212) math function: e


 [ 
https://issues.apache.org/jira/browse/SPARK-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8212.

   Resolution: Fixed
Fix Version/s: 1.5.0

 math function: e
 

 Key: SPARK-8212
 URL: https://issues.apache.org/jira/browse/SPARK-8212
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang
 Fix For: 1.5.0


 e(): double
 Returns the value of e.
 We should make this foldable so it gets folded by the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException

2015-06-10 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580695#comment-14580695
 ] 

Marcelo Vanzin commented on SPARK-8142:
---

bq.  If I mark my spark core as provided right now, as we speak , my code 
compiles but when I run my application in my IDE using Spark local I get: 
NoClassFoundError

Not that this necessarily helps you, but that sounds like either a bug in your 
IDE or in how you've set it up to run your app. Because provided dependencies 
show up in test scope too; so I'd expect that when you're running something 
from an IDE it would behave similarly.

re: SPARK-1867, that's a very generic problem that has multiple root causes. 
e.g. using a version of hbase libraries on the client side that is not 
compatible with the server side could lead to the same bug.

 Spark Job Fails with ResultTask ClassCastException
 --

 Key: SPARK-8142
 URL: https://issues.apache.org/jira/browse/SPARK-8142
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Dev Lakhani

 When running a Spark Job, I get no failures in the application code 
 whatsoever but a weird ResultTask Class exception. In my job, I create a RDD 
 from HBase and for each partition do a REST call on an API, using a REST 
 client.  This has worked in IntelliJ but when I deploy to a cluster using 
 spark-submit.sh I get :
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
 (TID 3, host): java.lang.ClassCastException: 
 org.apache.spark.scheduler.ResultTask cannot be cast to 
 org.apache.spark.scheduler.Task
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 These are the configs I set to override the spark classpath because I want to 
 use my own glassfish jersey version:
  
 sparkConf.set(spark.driver.userClassPathFirst,true);
 sparkConf.set(spark.executor.userClassPathFirst,true);
 I see no other warnings or errors in any of the logs.
 Unfortunately I cannot post my code, but please ask me questions that will 
 help debug the issue. Using spark 1.3.1 hadoop 2.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly


 [ 
https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-8290:

Summary: spark class command builder need read SPARK_JAVA_OPTS and 
SPARK_DRIVER_MEMORY properly  (was: spark class command builder need read 
SPARK_JAVA_OPTS)

 spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY 
 properly
 --

 Key: SPARK-8290
 URL: https://issues.apache.org/jira/browse/SPARK-8290
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Tao Wang
Priority: Minor

 SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
 it back so spark-class could read it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly


 [ 
https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-8290:

Description: 
SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
it back so spark-class could read it.

The missing part is here.

  was:SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should 
add it back so spark-class could read it.


 spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY 
 properly
 --

 Key: SPARK-8290
 URL: https://issues.apache.org/jira/browse/SPARK-8290
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Tao Wang
Priority: Minor

 SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
 it back so spark-class could read it.
 The missing part is here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException

2015-06-10 Thread Dev Lakhani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580689#comment-14580689
 ] 

Dev Lakhani commented on SPARK-8142:


To clarify [~srowen] 

1)  I meant the other way around, if we choose to use Apache Spark, which 
provides Apache Hadoop libs and we then choose a Cloudera Hadoop distribution 
on our (the rest of our) cluster and use Cloudera Hadoop clients in the 
application code. Spark will provide Apache Hadoop libs whereas our cluster 
will be cdh5. Is there any issue in doing this?

We choose to use Apache Spark because the CDH is a version behind the official 
Spark release and we don't want to wait for say Dataframes support.

2) If I mark my spark core as provided right now, as we speak , my code 
compiles but when I run my application in my IDE using Spark local I get: 
NoClassFoundError org/apache/spark/api/java/function/Function this is why I am 
suggesting whether we need maven profiles, one for local testing and one for 
deployment? 

So getting back to the issue raised in this JIRA, which we seem to be ignoring, 
even when Hadoop and Spark is provided and Hbase client/protocol/server is 
packaged we run into SPARK-1867 which at latest comment suggests a dependency 
is missing and this results in the obscure exception. Whether this is on the 
Hadoop side or Spark side is not known but as the JIRA suggests it was caused 
by a missing dependency. I cannot see this missing class/dependency exception 
anywhere in the spark logs. This suggests that if anyone using Spark sets any 
of the userClasspath* misses out a primary, secondary or tertiary dependency 
they will encounter SPARK-1867.

Therefore we are stuck, any suggestions are welcome to overcome this. Either 
there is a need make ChildFirstURLClassLoader ignore Spark and Hadoop libs or 
help spark log what's causing SPARK-1867.



 Spark Job Fails with ResultTask ClassCastException
 --

 Key: SPARK-8142
 URL: https://issues.apache.org/jira/browse/SPARK-8142
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Dev Lakhani

 When running a Spark Job, I get no failures in the application code 
 whatsoever but a weird ResultTask Class exception. In my job, I create a RDD 
 from HBase and for each partition do a REST call on an API, using a REST 
 client.  This has worked in IntelliJ but when I deploy to a cluster using 
 spark-submit.sh I get :
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
 (TID 3, host): java.lang.ClassCastException: 
 org.apache.spark.scheduler.ResultTask cannot be cast to 
 org.apache.spark.scheduler.Task
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 These are the configs I set to override the spark classpath because I want to 
 use my own glassfish jersey version:
  
 sparkConf.set(spark.driver.userClassPathFirst,true);
 sparkConf.set(spark.executor.userClassPathFirst,true);
 I see no other warnings or errors in any of the logs.
 Unfortunately I cannot post my code, but please ask me questions that will 
 help debug the issue. Using spark 1.3.1 hadoop 2.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException

2015-06-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580719#comment-14580719
 ] 

Sean Owen commented on SPARK-8142:
--

(FWIW CDH ships the current latest release. You can always run your own build 
on CDH when you like as it's just a YARN-based app. You can also use anything 
in Spark, like DataFrames, in the provided build. It's not different.)

In this scenario Spark doesn't provide anything Hadoop related; it's just 
Spark. Spark uses Hadoop code; CDH is Hadoop. I don't see what the problem 
you're expecting?

You can't run your app outside of Spark if it's provided since the Spark 
deployment model is to run an app using spark-submit. Yes I can see why your 
test-scope dependencies would be, well, test scope and not provided for Spark, 
since you want to run everything in one JVM. That's easy though.

 Spark Job Fails with ResultTask ClassCastException
 --

 Key: SPARK-8142
 URL: https://issues.apache.org/jira/browse/SPARK-8142
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Dev Lakhani

 When running a Spark Job, I get no failures in the application code 
 whatsoever but a weird ResultTask Class exception. In my job, I create a RDD 
 from HBase and for each partition do a REST call on an API, using a REST 
 client.  This has worked in IntelliJ but when I deploy to a cluster using 
 spark-submit.sh I get :
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
 (TID 3, host): java.lang.ClassCastException: 
 org.apache.spark.scheduler.ResultTask cannot be cast to 
 org.apache.spark.scheduler.Task
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 These are the configs I set to override the spark classpath because I want to 
 use my own glassfish jersey version:
  
 sparkConf.set(spark.driver.userClassPathFirst,true);
 sparkConf.set(spark.executor.userClassPathFirst,true);
 I see no other warnings or errors in any of the logs.
 Unfortunately I cannot post my code, but please ask me questions that will 
 help debug the issue. Using spark 1.3.1 hadoop 2.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException

2015-06-10 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580670#comment-14580670
 ] 

Marcelo Vanzin commented on SPARK-8142:
---

So, using {{userClassPathFirst}} is tricky exactly because of these issues. You 
have to be super careful when you have classes in your app that cross the class 
loader boundaries.

Two general comments:
- if you want to use the glassfish jersey version, you shouldn't need to do 
this, right? Spark depends on the old one that is under com.sun.*, IIRC.
- marking all dependencies (including hbase) as provided and using 
{{spark.{driver,executor}.extraClassPath}} might be the easiest way out if you 
really need to use {{userClassPathFirst}}.

Basically the class cast exceptions you're getting are because you have the 
same class in both Spark's class loader and your app's class loader, and those 
classes need to cross that boundary. So if you make sure in your app's build 
that these conflicts do not occur, then using {{userClassPathFirst}} should 
work.

Sorry I don't have a better suggestion; it's just not that trivial of a 
problem. :-/ We could make the child-first class loader configurable, so that 
you can set a sort of blacklist of packages where it should look at the 
parent first, but that would still require people to fiddle with configurations 
to make things work.



 Spark Job Fails with ResultTask ClassCastException
 --

 Key: SPARK-8142
 URL: https://issues.apache.org/jira/browse/SPARK-8142
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Dev Lakhani

 When running a Spark Job, I get no failures in the application code 
 whatsoever but a weird ResultTask Class exception. In my job, I create a RDD 
 from HBase and for each partition do a REST call on an API, using a REST 
 client.  This has worked in IntelliJ but when I deploy to a cluster using 
 spark-submit.sh I get :
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
 (TID 3, host): java.lang.ClassCastException: 
 org.apache.spark.scheduler.ResultTask cannot be cast to 
 org.apache.spark.scheduler.Task
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 These are the configs I set to override the spark classpath because I want to 
 use my own glassfish jersey version:
  
 sparkConf.set(spark.driver.userClassPathFirst,true);
 sparkConf.set(spark.executor.userClassPathFirst,true);
 I see no other warnings or errors in any of the logs.
 Unfortunately I cannot post my code, but please ask me questions that will 
 help debug the issue. Using spark 1.3.1 hadoop 2.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-06-10 Thread Janani Mukundan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580661#comment-14580661
 ] 

Janani Mukundan commented on SPARK-5575:


Hi Alexander,
I forked your latest version of 
https://github.com/avulanov/spark/tree/ann-interface-gemm.
I would like to contribute to the MLlib by adding an implementation of a DBN. 
I have a scala implementation working right now. I am going to try and merge it 
with your ANN models. 
Thanks
Janani

 Artificial neural networks for MLlib deep learning
 --

 Key: SPARK-5575
 URL: https://issues.apache.org/jira/browse/SPARK-5575
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov

 Goal: Implement various types of artificial neural networks
 Motivation: deep learning trend
 Requirements: 
 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
 and Backpropagation etc. should be implemented as traits or interfaces, so 
 they can be easily extended or reused
 2) Implement complex abstractions, such as feed forward and recurrent networks
 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
 autoencoder (sparse and denoising), stacked autoencoder, restricted  
 boltzmann machines (RBM), deep belief networks (DBN) etc.
 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
 poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException

2015-06-10 Thread Dev Lakhani (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580730#comment-14580730
]

Dev Lakhani commented on SPARK-8142:

Hi [~vanzin]

bq. if you want to use the glassfish jersey version, you shouldn't need to do
this, right? Spark depends on the old one that is under com.sun.*, IIRC.

Yes I need to make use of glassfish 2.x in my application and not the sun.* one
provided, but this could apply to any other dependency that needs to supersede
Sparks provided etc.

bq. marking all dependencies (including hbase) as provided and using {{spark.
{driver,executor}.extraClassPath}} might be the easiest way out if you really
need to use userClassPathFirst.

This is an option but might be a challenge to scale if we have different
folder layouts for the extraClassPath in different clusters/nodes for hbase and
hadoop installs. This can be (and usually is) the case when new servers are
added to existing ones for example. If we had /disk4/path/to/hbase/libs and
the other has /disk3/another/path/to/hbase/libs and so on then the
extraClassPath will need to include these both and grow significantly and spark
submit args along with it. Also when we update Hbase these then have to change
this classpath each time.

Maybe the ideal way is to have, as you suggest, a blacklist which would contain
spark and hadoop libs. Then we could put whatever we wanted into one uber/fat
jar and it doesn't matter where Hbase and Hadoop are installed or what's
provided and compiled, but we let spark work it out.

These are just my thoughts, I'm sure others will have different preferences
and/or better approaches. Thanks anyway for your input on this JIRA.

Spark Job Fails with ResultTask ClassCastException
--

Key: SPARK-8142
URL: https://issues.apache.org/jira/browse/SPARK-8142
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.3.1
Reporter: Dev Lakhani

When running a Spark Job, I get no failures in the application code
whatsoever but a weird ResultTask Class exception. In my job, I create a RDD
from HBase and for each partition do a REST call on an API, using a REST
client. This has worked in IntelliJ but when I deploy to a cluster using
spark-submit.sh I get :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, host): java.lang.ClassCastException:
org.apache.spark.scheduler.ResultTask cannot be cast to
org.apache.spark.scheduler.Task
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
These are the configs I set to override the spark classpath because I want to
use my own glassfish jersey version:

sparkConf.set(spark.driver.userClassPathFirst,true);
sparkConf.set(spark.executor.userClassPathFirst,true);
I see no other warnings or errors in any of the logs.
Unfortunately I cannot post my code, but please ask me questions that will
help debug the issue. Using spark 1.3.1 hadoop 2.6.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8212) math function: e


 [ 
https://issues.apache.org/jira/browse/SPARK-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8212:
---
Labels: missing-python  (was: )

 math function: e
 

 Key: SPARK-8212
 URL: https://issues.apache.org/jira/browse/SPARK-8212
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang
  Labels: missing-python
 Fix For: 1.5.0


 e(): double
 Returns the value of e.
 We should make this foldable so it gets folded by the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6284) Support framework authentication and role in Mesos framework

2015-06-10 Thread DarinJ (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580780#comment-14580780
 ] 

DarinJ commented on SPARK-6284:
---

+1 we're currently patching spark with similar code to make it play nice on our 
cluster.  Would be great to have the ability out of the box.

 Support framework authentication and role in Mesos framework
 

 Key: SPARK-6284
 URL: https://issues.apache.org/jira/browse/SPARK-6284
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Support framework authentication and role in both Coarse grain and fine grain 
 mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8215) math function: pi


 [ 
https://issues.apache.org/jira/browse/SPARK-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8215.

   Resolution: Fixed
Fix Version/s: 1.5.0

 math function: pi
 -

 Key: SPARK-8215
 URL: https://issues.apache.org/jira/browse/SPARK-8215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang
 Fix For: 1.5.0


 pi(): double
 Returns the value of pi. We should make sure foldable = true so it gets 
 folded by the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8215) math function: pi


 [ 
https://issues.apache.org/jira/browse/SPARK-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8215:
---
Labels: missing-python  (was: )

 math function: pi
 -

 Key: SPARK-8215
 URL: https://issues.apache.org/jira/browse/SPARK-8215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang
  Labels: missing-python
 Fix For: 1.5.0


 pi(): double
 Returns the value of pi. We should make sure foldable = true so it gets 
 folded by the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8295) SparkContext shut down in Spark Project Streaming test suite

2015-06-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8295:
-
Priority: Minor  (was: Blocker)

Definitely not a blocker, but worth tracking down if it can be reproduced 
elsewhere. The thing it, these tests appear to be passing elsewhere right now 
on Jenkins, including for Hadoop 2.4. 

 SparkContext shut down in Spark Project Streaming test suite
 

 Key: SPARK-8295
 URL: https://issues.apache.org/jira/browse/SPARK-8295
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.1
 Environment: * Ubuntu 12.04.05 LTS (GNU/Linux 3.13.0-53-generic 
 x86_64) running on VM
 * java version 1.8.0_45
 * Spark 1.3.1, branch 1.3
Reporter: Francois-Xavier Lemire
Priority: Minor

 *Command to build Spark:*
 build/mvn -Pyarn -Phadoop-2.4 -Pspark-ganglia-lgpl -Dhadoop.version=2.4.0 
 -DskipTests clean package
 *Command to test Spark:*
 build/mvn -Pyarn -Phadoop-2.4 -Pspark-ganglia-lgpl -Dhadoop.version=2.4.0 test
 *Error:*
 Tests fail at 'Spark Project Streaming'. 
 *Log:*
 [...]
 - awaitTermination
 - awaitTermination after stop
 - awaitTermination with error in task
 - awaitTermination with error in job generation
 - awaitTerminationOrTimeout
 Exception in thread Thread-1053 org.apache.spark.SparkException: Job 
 cancelled because SparkContext was shut down
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:699)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:698)
 at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
 at 
 org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:698)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1411)
 at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1346)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1398)
 at 
 org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:580)
 at 
 org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:555)
 at 
 org.apache.spark.streaming.testPackage.package$.test(StreamingContextSuite.scala:437)
 at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$25.apply$mcV$sp(StreamingContextSuite.scala:332)
 at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$25.apply(StreamingContextSuite.scala:332)
 at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$25.apply(StreamingContextSuite.scala:332)
 at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 at org.scalatest.Transformer.apply(Transformer.scala:22)
 at org.scalatest.Transformer.apply(Transformer.scala:20)
 at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
 at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
 at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
 at 
 org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:35)
 at 
 org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
 at 
 org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:35)
 at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
 at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
 at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
 at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
 at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
 at

[jira] [Created] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN

Mridul Muralidharan created SPARK-8297:
--

 Summary: Scheduler backend is not notified in case node fails in 
YARN
 Key: SPARK-8297
 URL: https://issues.apache.org/jira/browse/SPARK-8297
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1, 1.2.2, 1.4.1
 Environment: Spark on yarn - both client and cluster mode.
Reporter: Mridul Muralidharan
Assignee: Mridul Muralidharan
Priority: Critical


When a node crashes, yarn detects the failure and notifies spark - but this 
information is not propagated to scheduler backend (unlike in mesos mode, for 
example).

It results in repeated re-execution of stages (due to FetchFailedException on 
shuffle side), resulting finally in application failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8282) Make number of threads used in RBackend configurable


 [ 
https://issues.apache.org/jira/browse/SPARK-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8282.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Assignee: Hossein Falaki
Target Version/s: 1.4.1, 1.5.0  (was: 1.4.1)

 Make number of threads used in RBackend configurable
 

 Key: SPARK-8282
 URL: https://issues.apache.org/jira/browse/SPARK-8282
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Hossein Falaki
Assignee: Hossein Falaki
 Fix For: 1.4.1, 1.5.0


 RBackend starts a netty server which uses two threads. The number of threads 
 is hardcoded. It is useful to have it configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError

2015-06-10 Thread ABHISHEK CHOUDHARY (JIRA)

ABHISHEK CHOUDHARY created SPARK-8296:
-

 Summary: Not able to load Dataframe using Python throws 
py4j.protocol.Py4JJavaError
 Key: SPARK-8296
 URL: https://issues.apache.org/jira/browse/SPARK-8296
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.1
Reporter: ABHISHEK CHOUDHARY


While trying to load a json file using sqlcontext in prebuilt 
spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContext = SQLContext(sc)

# Create the DataFrame
df = sqlContext.jsonFile(changes.json)

# Show the content of the DataFrame
df.show()

Error thrown -

  File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, 
line 11, in module
df = sqlContext.jsonFile(changes.json)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py,
 line 377, in jsonFile
df = self._ssql_ctx.jsonFile(path, samplingRatio)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
py4j.protocol.Py4JJavaError




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN


 [ 
https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-8297:
---
Assignee: (was: Mridul Muralidharan)

 Scheduler backend is not notified in case node fails in YARN
 

 Key: SPARK-8297
 URL: https://issues.apache.org/jira/browse/SPARK-8297
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.2, 1.3.1, 1.4.1
 Environment: Spark on yarn - both client and cluster mode.
Reporter: Mridul Muralidharan
Priority: Critical

 When a node crashes, yarn detects the failure and notifies spark - but this 
 information is not propagated to scheduler backend (unlike in mesos mode, for 
 example).
 It results in repeated re-execution of stages (due to FetchFailedException on 
 shuffle side), resulting finally in application failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8064) Upgrade Hive to 1.2


[ 
https://issues.apache.org/jira/browse/SPARK-8064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580886#comment-14580886
 ] 

Apache Spark commented on SPARK-8064:
-

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/6748

 Upgrade Hive to 1.2
 ---

 Key: SPARK-8064
 URL: https://issues.apache.org/jira/browse/SPARK-8064
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Steve Loughran
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6419) GenerateOrdering does not support BinaryType and complex types.

2015-06-10 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580900#comment-14580900
 ] 

Davies Liu commented on SPARK-6419:
---

Fixed by SPARK-7956

 GenerateOrdering does not support BinaryType and complex types.
 ---

 Key: SPARK-6419
 URL: https://issues.apache.org/jira/browse/SPARK-6419
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Davies Liu

 When user want to order by binary columns or columns with complex types and 
 code gen is enabled, there will be a MatchError ([see 
 here|https://github.com/apache/spark/blob/v1.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala#L45]).
  We can either add supports for these types or have a function to check if we 
 can safely call GenerateOrdering (like the canBeCodeGened for HashAggregation 
 Strategy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7996) Deprecate the developer api SparkEnv.actorSystem


 [ 
https://issues.apache.org/jira/browse/SPARK-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7996.

   Resolution: Fixed
Fix Version/s: 1.5.0
 Assignee: Ilya Ganelin

 Deprecate the developer api SparkEnv.actorSystem
 

 Key: SPARK-7996
 URL: https://issues.apache.org/jira/browse/SPARK-7996
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Ilya Ganelin
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7996) Deprecate the developer api SparkEnv.actorSystem


[ 
https://issues.apache.org/jira/browse/SPARK-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580912#comment-14580912
 ] 

Reynold Xin commented on SPARK-7996:


[~ilganeli] that one is for later versions.


 Deprecate the developer api SparkEnv.actorSystem
 

 Key: SPARK-7996
 URL: https://issues.apache.org/jira/browse/SPARK-7996
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Ilya Ganelin
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8103) DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures


 [ 
https://issues.apache.org/jira/browse/SPARK-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8103:
---

Assignee: Apache Spark  (was: Imran Rashid)

 DAGScheduler should not launch multiple concurrent attempts for one stage on 
 fetch failures
 ---

 Key: SPARK-8103
 URL: https://issues.apache.org/jira/browse/SPARK-8103
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.4.0
Reporter: Imran Rashid
Assignee: Apache Spark

 When there is a fetch failure, {{DAGScheduler}} is supposed to fail the 
 stage, retry the necessary portions of the preceding shuffle stage which 
 generated the shuffle data, and eventually rerun the stage.  
 We generally expect to get multiple fetch failures together, but only want to 
 re-start the stage once.  The code already makes an attempt to address this 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1108
  .  
 {code}
// It is likely that we receive multiple FetchFailed for a single 
 stage (because we have
 // multiple tasks running concurrently on different executors). In 
 that case, it is possible
 // the fetch failure has already been handled by the scheduler.
 if (runningStages.contains(failedStage)) {
 {code}
 However, this logic is flawed because the stage may have been **resubmitted** 
 by the time we get these fetch failures.  In that case, 
 {{runningStages.contains(failedStage)}} will be true, but we've already 
 handled these failures.
 This results in multiple concurrent non-zombie attempts for one stage.  In 
 addition to being very confusing, and a waste of resources, this also can 
 lead to later stages being submitted before the previous stage has registered 
 its map output.  This happens because
 (a) when one attempt finishes all its tasks, it may not register its map 
 output because the stage still has pending tasks, from other attempts 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1046
 {code}
 if (runningStages.contains(shuffleStage)  
 shuffleStage.pendingTasks.isEmpty) {
 {code}
 and (b) {{submitStage}} thinks the following stage is ready to go, because 
 {{getMissingParentStages}} thinks the stage is complete as long it has all of 
 its map outputs: 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L397
 {code}
 if (!mapStage.isAvailable) {
   missing += mapStage
 }
 {code}
 So the following stage is submitted repeatedly, but it is doomed to fail 
 because its shuffle output has never been registered with the map output 
 tracker.  Here's an example failure in this case:
 {noformat}
 WARN TaskSetManager: Lost task 5.0 in stage 3.2 (TID 294, 192.168.1.104): 
 FetchFailed(null, shuffleId=0, mapId=-1, reduceId=5, message=
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing output 
 locations for shuffle ...
 {noformat}
 Note that this is a subset of the problems originally described in 
 SPARK-7308, limited to just the issues effecting the DAGScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7261) Change default log level to WARN in the REPL


 [ 
https://issues.apache.org/jira/browse/SPARK-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7261.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Change default log level to WARN in the REPL
 

 Key: SPARK-7261
 URL: https://issues.apache.org/jira/browse/SPARK-7261
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell
Reporter: Patrick Wendell
Assignee: Shixiong Zhu
Priority: Blocker
  Labels: starter
 Fix For: 1.5.0


 We should add a log4j properties file for the repl 
 (log4j-defaults-repl.properties) that has the level of warning. The main 
 reason for doing this is that we now display nice progress bars in the REPL 
 so the need for task level INFO messages is much less.
 The best way to accomplish this is the following:
 1. Add a second logging defaults file called log4j-defaults-repl.properties 
 that has log level WARN. 
 https://github.com/apache/spark/blob/branch-1.4/core/src/main/resources/org/apache/spark/log4j-defaults.properties
 2. When logging is initialized, check whether you are inside the REPL. If so, 
 then use that one:
 https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/Logging.scala#L124
 3. The printed message should say something like:
 Using Spark's repl log4j profile: 
 org/apache/spark/log4j-defaults-repl.properties
 To adjust logging level use sc.setLogLevel(INFO)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7527) Wrong detection of REPL mode in ClosureCleaner


 [ 
https://issues.apache.org/jira/browse/SPARK-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7527.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Assignee: Shixiong Zhu
Target Version/s: 1.5.0

 Wrong detection of REPL mode in ClosureCleaner
 --

 Key: SPARK-7527
 URL: https://issues.apache.org/jira/browse/SPARK-7527
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Oleksii Kostyliev
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.5.0


 If REPL class is not present on the classpath, the {{inIntetpreter}} boolean 
 switch shall be {{false}}, not {{true}} at: 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L247



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5162) Python yarn-cluster mode


 [ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5162.

  Resolution: Fixed
   Fix Version/s: 1.4.0
Assignee: Lianhui Wang
Target Version/s: 1.4.0

 Python yarn-cluster mode
 

 Key: SPARK-5162
 URL: https://issues.apache.org/jira/browse/SPARK-5162
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, YARN
Reporter: Dana Klassen
Assignee: Lianhui Wang
  Labels: cluster, python, yarn
 Fix For: 1.4.0


 Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
 be great to be able to submit python applications to the cluster and (just 
 like java classes) have the resource manager setup an AM on any node in the 
 cluster. Does anyone know the issues blocking this feature? I was snooping 
 around with enabling python apps:
 Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
 ...
 // The following modes are not supported or applicable
 (clusterManager, deployMode) match {
   ...
   case (_, CLUSTER) if args.isPython =
 printErrorAndExit(Cluster deploy mode is currently not supported for 
 python applications.)
   ...
 }
 …
 and submitting application via:
 HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
 --num-executors 2  —-py-files {{insert location of egg here}} 
 --executor-cores 1  ../tools/canary.py
 Everything looks to run alright, pythonRunner is picked up as main class, 
 resources get setup, yarn client gets launched but falls flat on its face:
 2015-01-08 18:48:03,444 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  DEBUG: FAILED { 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
 1420742868009, FILE, null }, Resource 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
 on src filesystem (expected 1420742868009, was 1420742869284
 and
 2015-01-08 18:48:03,446 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
  Resource 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(-/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
  transitioned from DOWNLOADING to FAILED
 Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
 to container localization of files upon downloading. At this point thought it 
 would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5479) PySpark on yarn mode need to support non-local python files


 [ 
https://issues.apache.org/jira/browse/SPARK-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5479.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Assignee: Marcelo Vanzin
Target Version/s: 1.5.0

 PySpark on yarn mode need to support non-local python files
 ---

 Key: SPARK-5479
 URL: https://issues.apache.org/jira/browse/SPARK-5479
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
Reporter: Lianhui Wang
Assignee: Marcelo Vanzin
 Fix For: 1.5.0


  In SPARK-5162 [~vgrigor] reports this:
 Now following code cannot work:
 aws emr add-steps --cluster-id j-XYWIXMD234 \
 --steps 
 Name=SparkPi,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://mybucketat.amazonaws.com/tasks/main.py,main.py,param1],ActionOnFailure=CONTINUE
 so we need to support non-local python files on yarn client and cluster mode.
 before submitting application to Yarn, we need to download non-local files to 
 local or hdfs path.
 or spark.yarn.dist.files need to support other non-local files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly


 [ 
https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8290.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Assignee: Tao Wang
Target Version/s: 1.5.0

 spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY 
 properly
 --

 Key: SPARK-8290
 URL: https://issues.apache.org/jira/browse/SPARK-8290
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Tao Wang
Assignee: Tao Wang
Priority: Minor
 Fix For: 1.5.0


 SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
 it back so spark-class could read it.
 The missing part is here: 
 https://github.com/apache/spark/blob/1c30afdf94b27e1ad65df0735575306e65d148a1/bin/spark-class#L97.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly


 [ 
https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8290:
-
Affects Version/s: 1.3.0

 spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY 
 properly
 --

 Key: SPARK-8290
 URL: https://issues.apache.org/jira/browse/SPARK-8290
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Tao Wang
Priority: Minor
 Fix For: 1.5.0


 SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add 
 it back so spark-class could read it.
 The missing part is here: 
 https://github.com/apache/spark/blob/1c30afdf94b27e1ad65df0735575306e65d148a1/bin/spark-class#L97.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError

2015-06-10 Thread ABHISHEK CHOUDHARY (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-8296:
--
Description: 
While trying to load a json file using sqlcontext in prebuilt 
spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContext = SQLContext(sc)

# Create the DataFrame
df = sqlContext.jsonFile(changes.json)

# Show the content of the DataFrame
df.show()

Error thrown -

  File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, 
line 11, in module
df = sqlContext.jsonFile(changes.json)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py,
 line 377, in jsonFile
df = self._ssql_ctx.jsonFile(path, samplingRatio)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
py4j.protocol.Py4JJavaError


On checking through the source code, I found that 'gateway_client' is not valid 
.

  was:
While trying to load a json file using sqlcontext in prebuilt 
spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContext = SQLContext(sc)

# Create the DataFrame
df = sqlContext.jsonFile(changes.json)

# Show the content of the DataFrame
df.show()

Error thrown -

  File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, 
line 11, in module
df = sqlContext.jsonFile(changes.json)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py,
 line 377, in jsonFile
df = self._ssql_ctx.jsonFile(path, samplingRatio)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
py4j.protocol.Py4JJavaError



 Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError
 --

 Key: SPARK-8296
 URL: https://issues.apache.org/jira/browse/SPARK-8296
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.1
Reporter: ABHISHEK CHOUDHARY
  Labels: test

 While trying to load a json file using sqlcontext in prebuilt 
 spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError
 from pyspark.sql import SQLContext
 from pyspark import SparkContext
 sc = SparkContext()
 sqlContext = SQLContext(sc)
 # Create the DataFrame
 df = sqlContext.jsonFile(changes.json)
 # Show the content of the DataFrame
 df.show()
 Error thrown -
   File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, 
 line 11, in module
 df = sqlContext.jsonFile(changes.json)
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py,
  line 377, in jsonFile
 df = self._ssql_ctx.jsonFile(path, samplingRatio)
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError
 On checking through the source code, I found that 'gateway_client' is not 
 valid .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7186) Decouple internal Row from external Row

2015-06-10 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-7186:
-

Assignee: Davies Liu

 Decouple internal Row from external Row
 ---

 Key: SPARK-7186
 URL: https://issues.apache.org/jira/browse/SPARK-7186
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu
Priority: Blocker

 Currently, we use o.a.s.sql.Row both internally and externally. The external 
 interface is wider than what the internal needs because it is designed to 
 facilitate end-user programming. This design has proven to be very error 
 prone and cumbersome for internal Row implementations.
 As a first step, we should just create an InternalRow interface in the 
 catalyst module, which is identical to the current Row interface. And we 
 should switch all internal operators/expressions to use this InternalRow 
 instead. When we need to expose Row, we convert the InternalRow 
 implementation into Row for users.
 After this, we can start removing methods that don't make sense for 
 InternalRow (in a separate ticket). This is probably one of the most 
 important refactoring in Spark 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8103) DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures


[ 
https://issues.apache.org/jira/browse/SPARK-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581019#comment-14581019
 ] 

Apache Spark commented on SPARK-8103:
-

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/6750

 DAGScheduler should not launch multiple concurrent attempts for one stage on 
 fetch failures
 ---

 Key: SPARK-8103
 URL: https://issues.apache.org/jira/browse/SPARK-8103
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.4.0
Reporter: Imran Rashid
Assignee: Imran Rashid

 When there is a fetch failure, {{DAGScheduler}} is supposed to fail the 
 stage, retry the necessary portions of the preceding shuffle stage which 
 generated the shuffle data, and eventually rerun the stage.  
 We generally expect to get multiple fetch failures together, but only want to 
 re-start the stage once.  The code already makes an attempt to address this 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1108
  .  
 {code}
// It is likely that we receive multiple FetchFailed for a single 
 stage (because we have
 // multiple tasks running concurrently on different executors). In 
 that case, it is possible
 // the fetch failure has already been handled by the scheduler.
 if (runningStages.contains(failedStage)) {
 {code}
 However, this logic is flawed because the stage may have been **resubmitted** 
 by the time we get these fetch failures.  In that case, 
 {{runningStages.contains(failedStage)}} will be true, but we've already 
 handled these failures.
 This results in multiple concurrent non-zombie attempts for one stage.  In 
 addition to being very confusing, and a waste of resources, this also can 
 lead to later stages being submitted before the previous stage has registered 
 its map output.  This happens because
 (a) when one attempt finishes all its tasks, it may not register its map 
 output because the stage still has pending tasks, from other attempts 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1046
 {code}
 if (runningStages.contains(shuffleStage)  
 shuffleStage.pendingTasks.isEmpty) {
 {code}
 and (b) {{submitStage}} thinks the following stage is ready to go, because 
 {{getMissingParentStages}} thinks the stage is complete as long it has all of 
 its map outputs: 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L397
 {code}
 if (!mapStage.isAvailable) {
   missing += mapStage
 }
 {code}
 So the following stage is submitted repeatedly, but it is doomed to fail 
 because its shuffle output has never been registered with the map output 
 tracker.  Here's an example failure in this case:
 {noformat}
 WARN TaskSetManager: Lost task 5.0 in stage 3.2 (TID 294, 192.168.1.104): 
 FetchFailed(null, shuffleId=0, mapId=-1, reduceId=5, message=
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing output 
 locations for shuffle ...
 {noformat}
 Note that this is a subset of the problems originally described in 
 SPARK-7308, limited to just the issues effecting the DAGScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8103) DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures


 [ 
https://issues.apache.org/jira/browse/SPARK-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8103:
---

Assignee: Imran Rashid  (was: Apache Spark)

 DAGScheduler should not launch multiple concurrent attempts for one stage on 
 fetch failures
 ---

 Key: SPARK-8103
 URL: https://issues.apache.org/jira/browse/SPARK-8103
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.4.0
Reporter: Imran Rashid
Assignee: Imran Rashid

 When there is a fetch failure, {{DAGScheduler}} is supposed to fail the 
 stage, retry the necessary portions of the preceding shuffle stage which 
 generated the shuffle data, and eventually rerun the stage.  
 We generally expect to get multiple fetch failures together, but only want to 
 re-start the stage once.  The code already makes an attempt to address this 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1108
  .  
 {code}
// It is likely that we receive multiple FetchFailed for a single 
 stage (because we have
 // multiple tasks running concurrently on different executors). In 
 that case, it is possible
 // the fetch failure has already been handled by the scheduler.
 if (runningStages.contains(failedStage)) {
 {code}
 However, this logic is flawed because the stage may have been **resubmitted** 
 by the time we get these fetch failures.  In that case, 
 {{runningStages.contains(failedStage)}} will be true, but we've already 
 handled these failures.
 This results in multiple concurrent non-zombie attempts for one stage.  In 
 addition to being very confusing, and a waste of resources, this also can 
 lead to later stages being submitted before the previous stage has registered 
 its map output.  This happens because
 (a) when one attempt finishes all its tasks, it may not register its map 
 output because the stage still has pending tasks, from other attempts 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1046
 {code}
 if (runningStages.contains(shuffleStage)  
 shuffleStage.pendingTasks.isEmpty) {
 {code}
 and (b) {{submitStage}} thinks the following stage is ready to go, because 
 {{getMissingParentStages}} thinks the stage is complete as long it has all of 
 its map outputs: 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L397
 {code}
 if (!mapStage.isAvailable) {
   missing += mapStage
 }
 {code}
 So the following stage is submitted repeatedly, but it is doomed to fail 
 because its shuffle output has never been registered with the map output 
 tracker.  Here's an example failure in this case:
 {noformat}
 WARN TaskSetManager: Lost task 5.0 in stage 3.2 (TID 294, 192.168.1.104): 
 FetchFailed(null, shuffleId=0, mapId=-1, reduceId=5, message=
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing output 
 locations for shuffle ...
 {noformat}
 Note that this is a subset of the problems originally described in 
 SPARK-7308, limited to just the issues effecting the DAGScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN


[ 
https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581038#comment-14581038
 ] 

Mridul Muralidharan commented on SPARK-8297:


Spark on mesos handles this situation by calling removeExecutor() on the 
scheduler backend - yarn module does not.
I have this fixed locally, but unfortunately, I do not have the bandwidth to 
shepherd a patch.

The fix is simple - replicate something similar to what is done in 
CoarseMesosSchedulerBackend.slaveLost().
Essentially :
a) maintain a mapping from container-id to executor-id in YarnAllocator 
(consistent with and inverse of executorIdToContainer)
b) propagate the scheduler backend to YarnAllocator when 
YarnClusterScheduler.postCommitHook is called, 
c) In processCompletedContainers, if the container is not in 
releasedContainers, invoke backend.removeExecutor(executorId, msg) to notify 
backend that the executor has not exit'ed gracefully/expectedly.
d) Remove mapping from containerIdToExecutorId and executorIdToContainer in 
processCompletedContainers (The latter also fixes a memory leak in 
YarnAllocator btw).


In case no one is picking this one up, I can fix it later in 1.5 release cycle.

 Scheduler backend is not notified in case node fails in YARN
 

 Key: SPARK-8297
 URL: https://issues.apache.org/jira/browse/SPARK-8297
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.2, 1.3.1, 1.4.1
 Environment: Spark on yarn - both client and cluster mode.
Reporter: Mridul Muralidharan
Priority: Critical

 When a node crashes, yarn detects the failure and notifies spark - but this 
 information is not propagated to scheduler backend (unlike in mesos mode, for 
 example).
 It results in repeated re-execution of stages (due to FetchFailedException on 
 shuffle side), resulting finally in application failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN


 [ 
https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-8297:
---
Affects Version/s: 1.5.0

 Scheduler backend is not notified in case node fails in YARN
 

 Key: SPARK-8297
 URL: https://issues.apache.org/jira/browse/SPARK-8297
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0
 Environment: Spark on yarn - both client and cluster mode.
Reporter: Mridul Muralidharan
Priority: Critical

 When a node crashes, yarn detects the failure and notifies spark - but this 
 information is not propagated to scheduler backend (unlike in mesos mode, for 
 example).
 It results in repeated re-execution of stages (due to FetchFailedException on 
 shuffle side), resulting finally in application failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6419) GenerateOrdering does not support BinaryType and complex types.

2015-06-10 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-6419.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

 GenerateOrdering does not support BinaryType and complex types.
 ---

 Key: SPARK-6419
 URL: https://issues.apache.org/jira/browse/SPARK-6419
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Davies Liu
 Fix For: 1.5.0


 When user want to order by binary columns or columns with complex types and 
 code gen is enabled, there will be a MatchError ([see 
 here|https://github.com/apache/spark/blob/v1.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala#L45]).
  We can either add supports for these types or have a function to check if we 
 can safely call GenerateOrdering (like the canBeCodeGened for HashAggregation 
 Strategy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8293) Add high-level java docs to important YARN classes

Andrew Or created SPARK-8293:


 Summary: Add high-level java docs to important YARN classes
 Key: SPARK-8293
 URL: https://issues.apache.org/jira/browse/SPARK-8293
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, YARN
Affects Versions: 1.4.0
Reporter: Andrew Or


The YARN integration code has been with Spark for many releases now. However, 
important classes like `Client`, `ApplicationMaster` and `ExecutorRunnable` 
don't even have detailed javadocs yet.

This makes it difficult to follow the code without digging into too much detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8294) Break down large methods in YARN code