[jira] [Assigned] (SPARK-8169) Add StopWordsRemover as a transformer
[ https://issues.apache.org/jira/browse/SPARK-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8169: --- Assignee: Apache Spark Add StopWordsRemover as a transformer - Key: SPARK-8169 URL: https://issues.apache.org/jira/browse/SPARK-8169 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Apache Spark StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default. {code} val stopWords = new StopWordsRemover() .setInputCol(words) .setOutputCol(cleanWords) .setStopWords(Array(...)) // optional val output = stopWords.transform(df) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580423#comment-14580423 ] Li Sheng commented on SPARK-8287: - I'm working on this in a hack way and I think is not a proper way to solve this. Firstly, I collected the expression of filter and the attributesReferences in the Filter on the top of the Subquery or View. And match the attributesRef of each MetastoreRelation with the collected expression atrributesRef, once found the match, I will remove the atrributeRef related expression in the filter on the top of the Subquery. and Generate a new Filter on top of the metastoreRelation in the Subquery. eg: The `ds` is the attributeRef in the Filter on the top of the Subquery or View. Firstly I check the ouput of the `src_partitioned1` and `src_partitioned2`, if they contains the field `ds`. I will push the ds = '2' down to the top of src_partitioned1 and src_partitioned2. But I found that if the outer query filter on top of Subquery contains Alias. Things become more complex, because we don't know wether the alias is a real field in the MetastoreRelation or just a alias. eg: if `Subquery` name sum(1) as ds, ds is a alias. when do process above, is a wrong way, because `ds` is conflict(both exists in src_partitioned1 and Subquery result, can not easily push down it). BTW: Hive can know the `ds` is a partition key and push down it. I'd be very pleasure if you have some good suggestions to solve this. Filter not push down through Subquery or View - Key: SPARK-8287 URL: https://issues.apache.org/jira/browse/SPARK-8287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.4.0 Original Estimate: 40h Remaining Estimate: 40h Filter not push down through Subquery or View. Assume we have two big partitioned table join inner a Subquery or a View and filter not push down, this will cause a full partition join and will cause performance issues. Let me give and example that can reproduce the problem: {code:sql} create table src(key int, value string); -- Creates partitioned table and imports data CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value from src; CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value from src; -- Creates views create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds -- QueryExecution select * from dw.src_view where ds='2' {code} {noformat} sql(select * from dw.src_view where ds='2' ).queryExecution == Parsed Logical Plan == 'Project [*] 'Filter ('ds = 2) 'UnresolvedRelation [dw,src_view], None == Analyzed Logical Plan == Project [s1#60L,s2#61L,ds#62] Filter (ds#62 = 2) Subquery src_view Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Join Inner, Some((ds#63 = ds#66)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) == Optimized Logical Plan == Filter (ds#62 = 2) Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Project [ds#66,key#64,key#67] Join Inner, Some((ds#63 = ds#66)) Project [key#64,ds#63] MetastoreRelation dw, s... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8287: -- Description: Filter not push down through Subquery or View. Assume we have two big partitioned table join inner a Subquery or a View and filter not push down, this will cause a full partition join and will cause performance issues. Let me give and example that can reproduce the problem: {code:sql} create table src(key int, value string); -- Creates partitioned table and imports data CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value from src; CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value from src; -- Creates views create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds -- QueryExecution select * from dw.src_view where ds='2' {code} {noformat} sql(select * from dw.src_view where ds='2' ).queryExecution == Parsed Logical Plan == 'Project [*] 'Filter ('ds = 2) 'UnresolvedRelation [dw,src_view], None == Analyzed Logical Plan == Project [s1#60L,s2#61L,ds#62] Filter (ds#62 = 2) Subquery src_view Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Join Inner, Some((ds#63 = ds#66)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) == Optimized Logical Plan == Filter (ds#62 = 2) Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Project [ds#66,key#64,key#67] Join Inner, Some((ds#63 = ds#66)) Project [key#64,ds#63] MetastoreRelation dw, s... {noformat} was: Filter not push down through Subquery or View. Assume we have two big partitioned table join inner a Subquery or a View and filter not push down, this will cause a full partition join and will cause performance issues. Let me give and example that can reproduce the problem: create table src(key int, value string); --创建分区表并且导入数据 CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value from src; CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value from src; --创建视图 create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds --QueryExecution select * from dw.src_view where ds='2' sql(select * from dw.src_view where ds='2' ).queryExecution == Parsed Logical Plan == 'Project [*] 'Filter ('ds = 2) 'UnresolvedRelation [dw,src_view], None == Analyzed Logical Plan == Project [s1#60L,s2#61L,ds#62] Filter (ds#62 = 2) Subquery src_view Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Join Inner, Some((ds#63 = ds#66)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) == Optimized Logical Plan == Filter (ds#62 = 2) Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Project [ds#66,key#64,key#67] Join Inner, Some((ds#63 = ds#66)) Project [key#64,ds#63] MetastoreRelation dw, s... Filter not push down through Subquery or View - Key: SPARK-8287 URL: https://issues.apache.org/jira/browse/SPARK-8287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.4.0 Original Estimate: 40h Remaining Estimate: 40h Filter not push down through Subquery or View. Assume we have two big partitioned table join inner a Subquery or a View and filter not push down, this will cause a full partition join and will cause
[jira] [Commented] (SPARK-8048) Explicit partitionning of an RDD with 0 partition will yield empty outer join
[ https://issues.apache.org/jira/browse/SPARK-8048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580395#comment-14580395 ] Josiah Samuel Sathiadass commented on SPARK-8048: - Often constructing a HashPartitioner with 0 may not be intentional. So instead of a fall through, consider making the call to fail with an exception. Explicit partitionning of an RDD with 0 partition will yield empty outer join - Key: SPARK-8048 URL: https://issues.apache.org/jira/browse/SPARK-8048 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Olivier Toupin Priority: Minor Check this code = https://gist.github.com/anonymous/0f935915f2bc182841f0 Because of this = {{.partitionBy(new HashPartitioner(0))}} The join will return empty result. Here a normal expected behaviour would the join to crash, cause error, or to return unjoined results, but instead will yield an empty RDD. This a trivial exemple, but imagine: {{.partitionBy(new HashPartitioner(previous.partitions.length))}}. You join on an empty previous rdd, the lookup table is empty, Spark will you lose all your results, instead of returning unjoined results, and this without warnings or errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403 ] Li Sheng edited comment on SPARK-8287 at 6/10/15 11:16 AM: --- Sorry [~lian cheng] , see this: scala sql(select * from dw.src_view where ds='2' ).queryExecution.optimizedPlan 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view where ds='2' 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned2 res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (ds#936 = 2) Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, LongType)) AS s2#935L,ds#940 AS ds#936] Project [ds#940,key#938,key#941] Join Inner, Some((ds#937 = ds#940)) Project [key#938,ds#937] MetastoreRelation dw, src_partitioned1, Some(a) Project [ds#940,key#941] MetastoreRelation dw, src_partitioned2, Some(b) Another one is for subquery , also can not push down the filter: scala sql(select * from dw.src_view_1 where my_ds='2' ).queryExecution.optimizedPlan 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from dw.src_view_1 where my_ds='2' 15/06/10 19:08:59 INFO ParseDriver: Parse Completed 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:08:59 INFO ParseDriver: Parse Completed 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned2 res254: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (my_ds#918 = 2) Aggregate [ds#922], [SUM(CAST(key#920, LongType)) AS s1#916L,SUM(CAST(key#923, LongType)) AS s2#917L,ds#922 AS my_ds#918] Project [ds#922,key#920,key#923] Join Inner, Some((ds#919 = ds#922)) Project [key#920,ds#919] MetastoreRelation dw, src_partitioned1, Some(a) Project [ds#922,key#923] MetastoreRelation dw, src_partitioned2, Some(b) was (Author: oopsoutofmemory): Sorry [~lian cheng] , see this: scala sql(select * from dw.src_view_1 where my_ds='2' ).queryExecution.optimizedPlan 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from dw.src_view_1 where my_ds='2' 15/06/10 19:08:59 INFO ParseDriver: Parse Completed 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2`
[jira] [Commented] (SPARK-6797) Add support for YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580417#comment-14580417 ] Apache Spark commented on SPARK-6797: - User 'sun-rui' has created a pull request for this issue: https://github.com/apache/spark/pull/6743 Add support for YARN cluster mode - Key: SPARK-6797 URL: https://issues.apache.org/jira/browse/SPARK-6797 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Assignee: Sun Rui Priority: Critical SparkR currently does not work in YARN cluster mode as the R package is not shipped along with the assembly jar to the YARN AM. We could try to use the support for archives in YARN to send out the R package as a zip file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6797) Add support for YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6797: --- Assignee: Sun Rui (was: Apache Spark) Add support for YARN cluster mode - Key: SPARK-6797 URL: https://issues.apache.org/jira/browse/SPARK-6797 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Assignee: Sun Rui Priority: Critical SparkR currently does not work in YARN cluster mode as the R package is not shipped along with the assembly jar to the YARN AM. We could try to use the support for archives in YARN to send out the R package as a zip file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6797) Add support for YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6797: --- Assignee: Apache Spark (was: Sun Rui) Add support for YARN cluster mode - Key: SPARK-6797 URL: https://issues.apache.org/jira/browse/SPARK-6797 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Assignee: Apache Spark Priority: Critical SparkR currently does not work in YARN cluster mode as the R package is not shipped along with the assembly jar to the YARN AM. We could try to use the support for archives in YARN to send out the R package as a zip file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580390#comment-14580390 ] Cheng Lian commented on SPARK-8287: --- [~OopsOutOfMemory] Could you please provide the output of {code} println(sql(select * from dw.src_view where ds='2').queryExecution) {code} instead of {code} sql(select * from dw.src_view where ds='2').queryExecution {code} The latter gets stripped and doesn't show the whole contents. Filter not push down through Subquery or View - Key: SPARK-8287 URL: https://issues.apache.org/jira/browse/SPARK-8287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.4.0 Original Estimate: 40h Remaining Estimate: 40h Filter not push down through Subquery or View. Assume we have two big partitioned table join inner a Subquery or a View and filter not push down, this will cause a full partition join and will cause performance issues. Let me give and example that can reproduce the problem: {code:sql} create table src(key int, value string); -- Creates partitioned table and imports data CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value from src; CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value from src; -- Creates views create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds -- QueryExecution select * from dw.src_view where ds='2' {code} {noformat} sql(select * from dw.src_view where ds='2' ).queryExecution == Parsed Logical Plan == 'Project [*] 'Filter ('ds = 2) 'UnresolvedRelation [dw,src_view], None == Analyzed Logical Plan == Project [s1#60L,s2#61L,ds#62] Filter (ds#62 = 2) Subquery src_view Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Join Inner, Some((ds#63 = ds#66)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) == Optimized Logical Plan == Filter (ds#62 = 2) Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Project [ds#66,key#64,key#67] Join Inner, Some((ds#63 = ds#66)) Project [key#64,ds#63] MetastoreRelation dw, s... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403 ] Li Sheng edited comment on SPARK-8287 at 6/10/15 11:11 AM: --- Sorry [~lian cheng] , see this: scala sql(select * from dw.src_view_1 where my_ds='2' ).queryExecution.optimizedPlan 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from dw.src_view_1 where my_ds='2' 15/06/10 19:08:59 INFO ParseDriver: Parse Completed 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:08:59 INFO ParseDriver: Parse Completed 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned2 res254: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (my_ds#918 = 2) Aggregate [ds#922], [SUM(CAST(key#920, LongType)) AS s1#916L,SUM(CAST(key#923, LongType)) AS s2#917L,ds#922 AS my_ds#918] Project [ds#922,key#920,key#923] Join Inner, Some((ds#919 = ds#922)) Project [key#920,ds#919] MetastoreRelation dw, src_partitioned1, Some(a) Project [ds#922,key#923] MetastoreRelation dw, src_partitioned2, Some(b) was (Author: oopsoutofmemory): Sorry , see this: scala sql(select * from dw.src_view_1 where my_ds='2' ).queryExecution.optimizedPlan 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from dw.src_view_1 where my_ds='2' 15/06/10 19:08:59 INFO ParseDriver: Parse Completed 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:08:59 INFO ParseDriver: Parse Completed 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned2 res254: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (my_ds#918 = 2) Aggregate [ds#922], [SUM(CAST(key#920, LongType)) AS s1#916L,SUM(CAST(key#923, LongType)) AS s2#917L,ds#922 AS my_ds#918] Project [ds#922,key#920,key#923] Join Inner, Some((ds#919 = ds#922)) Project [key#920,ds#919] MetastoreRelation dw, src_partitioned1, Some(a) Project [ds#922,key#923] MetastoreRelation dw, src_partitioned2, Some(b) Filter not push down through Subquery or View - Key: SPARK-8287 URL: https://issues.apache.org/jira/browse/SPARK-8287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.4.0 Original Estimate: 40h Remaining Estimate: 40h Filter not push down through Subquery or View. Assume we have two big partitioned table join inner a Subquery or a View and filter not push down, this will cause a full partition join and will cause performance issues. Let me give and
[jira] [Comment Edited] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403 ] Li Sheng edited comment on SPARK-8287 at 6/10/15 11:20 AM: --- Sorry [~lian cheng] , see this: scala sql(select * from dw.src_view where ds='2' ).queryExecution.analyzed res258: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [s1#943L,s2#944L,ds#945] Filter (ds#945 = 2) Subquery src_view Aggregate [ds#949], [SUM(CAST(key#947, LongType)) AS s1#943L,SUM(CAST(key#950, LongType)) AS s2#944L,ds#949 AS ds#945] Join Inner, Some((ds#946 = ds#949)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) scala sql(select * from dw.src_view where ds='2' ).queryExecution.optimizedPlan 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view where ds='2' 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned2 res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (ds#936 = 2) Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, LongType)) AS s2#935L,ds#940 AS ds#936] Project [ds#940,key#938,key#941] Join Inner, Some((ds#937 = ds#940)) Project [key#938,ds#937] MetastoreRelation dw, src_partitioned1, Some(a) Project [ds#940,key#941] MetastoreRelation dw, src_partitioned2, Some(b) scala sql(select * from dw.src_view where ds='2' ).queryExecution.executedPlan res259: org.apache.spark.sql.execution.SparkPlan = Filter (ds#954 = 2) Aggregate false, [ds#958], [SUM(PartialSum#963L) AS s1#952L,SUM(PartialSum#964L) AS s2#953L,ds#958 AS ds#954] Exchange (HashPartitioning [ds#958], 200) Aggregate true, [ds#958], [ds#958,SUM(CAST(key#956, LongType)) AS PartialSum#963L,SUM(CAST(key#959, LongType)) AS PartialSum#964L] Project [ds#958,key#956,key#959] ShuffledHashJoin [ds#955], [ds#958], BuildRight Exchange (HashPartitioning [ds#955], 200) HiveTableScan [key#956,ds#955], (MetastoreRelation dw, src_partitioned1, Some(a)), None Exchange (HashPartitioning [ds#958], 200) HiveTableScan [ds#958,key#959], (MetastoreRelation dw, src_partitioned2, Some(b)), None was (Author: oopsoutofmemory): Sorry [~lian cheng] , see this: scala sql(select * from dw.src_view where ds='2' ).queryExecution.analyzed res258: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [s1#943L,s2#944L,ds#945] Filter (ds#945 = 2) Subquery src_view Aggregate [ds#949], [SUM(CAST(key#947, LongType)) AS s1#943L,SUM(CAST(key#950, LongType)) AS s2#944L,ds#949 AS ds#945] Join Inner, Some((ds#946 = ds#949)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) scala sql(select * from dw.src_view where ds='2' ).queryExecution.optimizedPlan 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view where ds='2' 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table :
[jira] [Comment Edited] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403 ] Li Sheng edited comment on SPARK-8287 at 6/10/15 11:19 AM: --- Sorry [~lian cheng] , see this: scala sql(select * from dw.src_view where ds='2' ).queryExecution.analyzed res258: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [s1#943L,s2#944L,ds#945] Filter (ds#945 = 2) Subquery src_view Aggregate [ds#949], [SUM(CAST(key#947, LongType)) AS s1#943L,SUM(CAST(key#950, LongType)) AS s2#944L,ds#949 AS ds#945] Join Inner, Some((ds#946 = ds#949)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) scala sql(select * from dw.src_view where ds='2' ).queryExecution.optimizedPlan 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view where ds='2' 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned2 res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (ds#936 = 2) Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, LongType)) AS s2#935L,ds#940 AS ds#936] Project [ds#940,key#938,key#941] Join Inner, Some((ds#937 = ds#940)) Project [key#938,ds#937] MetastoreRelation dw, src_partitioned1, Some(a) Project [ds#940,key#941] MetastoreRelation dw, src_partitioned2, Some(b) was (Author: oopsoutofmemory): Sorry [~lian cheng] , see this: scala sql(select * from dw.src_view where ds='2' ).queryExecution.optimizedPlan 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view where ds='2' 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned2 res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (ds#936 = 2) Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, LongType)) AS s2#935L,ds#940 AS ds#936] Project [ds#940,key#938,key#941] Join Inner, Some((ds#937 = ds#940)) Project [key#938,ds#937] MetastoreRelation dw, src_partitioned1, Some(a) Project [ds#940,key#941] MetastoreRelation dw, src_partitioned2, Some(b) Filter not push down through Subquery or View - Key: SPARK-8287 URL: https://issues.apache.org/jira/browse/SPARK-8287
[jira] [Assigned] (SPARK-8169) Add StopWordsRemover as a transformer
[ https://issues.apache.org/jira/browse/SPARK-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8169: --- Assignee: (was: Apache Spark) Add StopWordsRemover as a transformer - Key: SPARK-8169 URL: https://issues.apache.org/jira/browse/SPARK-8169 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default. {code} val stopWords = new StopWordsRemover() .setInputCol(words) .setOutputCol(cleanWords) .setStopWords(Array(...)) // optional val output = stopWords.transform(df) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8169) Add StopWordsRemover as a transformer
[ https://issues.apache.org/jira/browse/SPARK-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580399#comment-14580399 ] Apache Spark commented on SPARK-8169: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/6742 Add StopWordsRemover as a transformer - Key: SPARK-8169 URL: https://issues.apache.org/jira/browse/SPARK-8169 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default. {code} val stopWords = new StopWordsRemover() .setInputCol(words) .setOutputCol(cleanWords) .setStopWords(Array(...)) // optional val output = stopWords.transform(df) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403 ] Li Sheng commented on SPARK-8287: - Sorry , see this: scala sql(select * from dw.src_view_1 where my_ds='2' ).queryExecution.optimizedPlan 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from dw.src_view_1 where my_ds='2' 15/06/10 19:08:59 INFO ParseDriver: Parse Completed 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:08:59 INFO ParseDriver: Parse Completed 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned2 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned2 res254: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (my_ds#918 = 2) Aggregate [ds#922], [SUM(CAST(key#920, LongType)) AS s1#916L,SUM(CAST(key#923, LongType)) AS s2#917L,ds#922 AS my_ds#918] Project [ds#922,key#920,key#923] Join Inner, Some((ds#919 = ds#922)) Project [key#920,ds#919] MetastoreRelation dw, src_partitioned1, Some(a) Project [ds#922,key#923] MetastoreRelation dw, src_partitioned2, Some(b) Filter not push down through Subquery or View - Key: SPARK-8287 URL: https://issues.apache.org/jira/browse/SPARK-8287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.4.0 Original Estimate: 40h Remaining Estimate: 40h Filter not push down through Subquery or View. Assume we have two big partitioned table join inner a Subquery or a View and filter not push down, this will cause a full partition join and will cause performance issues. Let me give and example that can reproduce the problem: {code:sql} create table src(key int, value string); -- Creates partitioned table and imports data CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value from src; CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value from src; -- Creates views create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds -- QueryExecution select * from dw.src_view where ds='2' {code} {noformat} sql(select * from dw.src_view where ds='2' ).queryExecution == Parsed Logical Plan == 'Project [*] 'Filter ('ds = 2) 'UnresolvedRelation [dw,src_view], None == Analyzed Logical Plan == Project [s1#60L,s2#61L,ds#62] Filter (ds#62 = 2) Subquery src_view Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Join Inner, Some((ds#63 = ds#66)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) == Optimized Logical Plan == Filter (ds#62 = 2) Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Project [ds#66,key#64,key#67] Join Inner, Some((ds#63 = ds#66)) Project [key#64,ds#63] MetastoreRelation dw, s... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Comment Edited] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580403#comment-14580403 ] Li Sheng edited comment on SPARK-8287 at 6/10/15 11:17 AM: --- Sorry [~lian cheng] , see this: scala sql(select * from dw.src_view where ds='2' ).queryExecution.optimizedPlan 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view where ds='2' 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned2 res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (ds#936 = 2) Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, LongType)) AS s2#935L,ds#940 AS ds#936] Project [ds#940,key#938,key#941] Join Inner, Some((ds#937 = ds#940)) Project [key#938,ds#937] MetastoreRelation dw, src_partitioned1, Some(a) Project [ds#940,key#941] MetastoreRelation dw, src_partitioned2, Some(b) was (Author: oopsoutofmemory): Sorry [~lian cheng] , see this: scala sql(select * from dw.src_view where ds='2' ).queryExecution.optimizedPlan 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select * from dw.src_view where ds='2' 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view 15/06/10 19:15:39 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` = `b`.`ds` group by `b`.`ds` 15/06/10 19:15:39 INFO ParseDriver: Parse Completed 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned1 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO HiveMetaStore: 0: get_partitions : db=dw tbl=src_partitioned2 15/06/10 19:15:39 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_partitions : db=dw tbl=src_partitioned2 res257: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (ds#936 = 2) Aggregate [ds#940], [SUM(CAST(key#938, LongType)) AS s1#934L,SUM(CAST(key#941, LongType)) AS s2#935L,ds#940 AS ds#936] Project [ds#940,key#938,key#941] Join Inner, Some((ds#937 = ds#940)) Project [key#938,ds#937] MetastoreRelation dw, src_partitioned1, Some(a) Project [ds#940,key#941] MetastoreRelation dw, src_partitioned2, Some(b) Another one is for subquery , also can not push down the filter: scala sql(select * from dw.src_view_1 where my_ds='2' ).queryExecution.optimizedPlan 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select * from dw.src_view_1 where my_ds='2' 15/06/10 19:08:59 INFO ParseDriver: Parse Completed 15/06/10 19:08:59 INFO HiveMetaStore: 0: get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO audit: ugi=shengli ip=unknown-ip-addr cmd=get_table : db=dw tbl=src_view_1 15/06/10 19:08:59 INFO ParseDriver: Parsing command: select sum(`a`.`key`) `s1`, sum(`b`.`key`) `s2`, `b`.`ds` `my_ds` from `dw`.`src_partitioned1` `a` join `dw`.`src_partitioned2` `b` on `a`.`ds` =
[jira] [Commented] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580413#comment-14580413 ] Li Sheng commented on SPARK-8287: - I think we can not simply `EliminateSubQueries` in Optimizer, the Subquery scope is useful when do optimisation with Subquery related stuff. It's useful for this scenario. Filter not push down through Subquery or View - Key: SPARK-8287 URL: https://issues.apache.org/jira/browse/SPARK-8287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.4.0 Original Estimate: 40h Remaining Estimate: 40h Filter not push down through Subquery or View. Assume we have two big partitioned table join inner a Subquery or a View and filter not push down, this will cause a full partition join and will cause performance issues. Let me give and example that can reproduce the problem: {code:sql} create table src(key int, value string); -- Creates partitioned table and imports data CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value from src; CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value from src; -- Creates views create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds -- QueryExecution select * from dw.src_view where ds='2' {code} {noformat} sql(select * from dw.src_view where ds='2' ).queryExecution == Parsed Logical Plan == 'Project [*] 'Filter ('ds = 2) 'UnresolvedRelation [dw,src_view], None == Analyzed Logical Plan == Project [s1#60L,s2#61L,ds#62] Filter (ds#62 = 2) Subquery src_view Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Join Inner, Some((ds#63 = ds#66)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) == Optimized Logical Plan == Filter (ds#62 = 2) Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Project [ds#66,key#64,key#67] Join Inner, Some((ds#63 = ds#66)) Project [key#64,ds#63] MetastoreRelation dw, s... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8275) HistoryServer caches incomplete App UIs
[ https://issues.apache.org/jira/browse/SPARK-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579273#comment-14579273 ] Steve Loughran edited comment on SPARK-8275 at 6/10/15 11:43 AM: - The cache code is from google; history server provides a method to get the data for an entry, but there's no logic in the cache itself to have a refresh time on entries. One solution would be # cache entries to include a timestamp and completed flag alongside the SparkUI instances # direct all cache.get operations through a single method in HistoryServer # have that method do something like {code} def getUI(id: String): SparkUI = { var cacheEntry = cache.get(id) if (!cacheEntry.completed (cacheEntry.timestamp + expiryTime) now()) { cache.release(id) cache.get(id) } cacheEntry } {code} this will leave out of date entries in the cache, but on any retrieval trigger the rebuild. was (Author: ste...@apache.org): The cache code is from google; history server provides a method to get the data for an entry, but there's no logic in the cache itself to have a refresh time on entries. One solution would be # cache entries to include a timestamp and completed flag alongside the SparkUI instances # direct all cache.get operations through a single method in HistoryServer # have that method do something like {code} def getUI(id: String): SparkUI = { var cacheEntry = cache.get(id) if (!cacheEntry.completed (cacheEntry.timestamp + expiryTime) now()) { cache.release(id) cache.get(id) } cacheEntry } this will leave out of date entries in the cache, but on any retrieval trigger the rebuild. HistoryServer caches incomplete App UIs --- Key: SPARK-8275 URL: https://issues.apache.org/jira/browse/SPARK-8275 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.1 Reporter: Steve Loughran The history server caches applications retrieved from the {{ApplicationHistoryProvider.getAppUI()}} call for performance: it's expensive to rebuild. However, this cache also includes incomplete applications, as well as completed ones —and it never attempts to refresh the incomplete application. As a result, if you do a GET of the history of a running application, even after the application is finished, you'll still get the web UI/history as it was when that first GET was issued. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-1537: -- Comment: was deleted (was: Full application log. Application hasn't actually stopped, which is interesting. {code} $ dist/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --properties-file ../clusterconfigs/clusters/devix/spark/spark-defaults.conf \ --master yarn-client \ --executor-memory 128m \ --num-executors 1 \ --executor-cores 1 \ --driver-memory 128m \ dist/lib/spark-examples-1.5.0-SNAPSHOT-hadoop2.6.0.jar 12 2015-06-09 17:01:59,596 [main] INFO spark.SparkContext (Logging.scala:logInfo(59)) - Running Spark version 1.5.0-SNAPSHOT 2015-06-09 17:02:01,309 [sparkDriver-akka.actor.default-dispatcher-2] INFO slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started 2015-06-09 17:02:01,359 [sparkDriver-akka.actor.default-dispatcher-2] INFO Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting 2015-06-09 17:02:01,542 [sparkDriver-akka.actor.default-dispatcher-2] INFO Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.86:51476] 2015-06-09 17:02:01,549 [main] INFO util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on port 51476. 2015-06-09 17:02:01,568 [main] INFO spark.SparkEnv (Logging.scala:logInfo(59)) - Registering MapOutputTracker 2015-06-09 17:02:01,587 [main] INFO spark.SparkEnv (Logging.scala:logInfo(59)) - Registering BlockManagerMaster 2015-06-09 17:02:01,831 [main] INFO spark.HttpServer (Logging.scala:logInfo(59)) - Starting HTTP Server 2015-06-09 17:02:01,891 [main] INFO util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'HTTP file server' on port 51477. 2015-06-09 17:02:01,905 [main] INFO spark.SparkEnv (Logging.scala:logInfo(59)) - Registering OutputCommitCoordinator 2015-06-09 17:02:02,038 [main] INFO util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'SparkUI' on port 4040. 2015-06-09 17:02:02,039 [main] INFO ui.SparkUI (Logging.scala:logInfo(59)) - Started SparkUI at http://192.168.1.86:4040 2015-06-09 17:02:03,071 [main] INFO spark.SparkContext (Logging.scala:logInfo(59)) - Added JAR file:/Users/stevel/Projects/Hortonworks/Projects/sparkwork/spark/dist/lib/spark-examples-1.5.0-SNAPSHOT-hadoop2.6.0.jar at http://192.168.1.86:51477/jars/spark-examples-1.5.0-SNAPSHOT-hadoop2.6.0.jar with timestamp 1433865723062 2015-06-09 17:02:03,691 [main] INFO impl.TimelineClientImpl (TimelineClientImpl.java:serviceInit(285)) - Timeline service address: http://devix.cotham.uk:8188/ws/v1/timeline/ 2015-06-09 17:02:03,808 [main] INFO client.RMProxy (RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at devix.cotham.uk/192.168.1.134:8050 2015-06-09 17:02:04,577 [main] INFO yarn.Client (Logging.scala:logInfo(59)) - Requesting a new application from cluster with 1 NodeManagers 2015-06-09 17:02:04,637 [main] INFO yarn.Client (Logging.scala:logInfo(59)) - Verifying our application has not requested more than the maximum memory capability of the cluster (2048 MB per container) 2015-06-09 17:02:04,637 [main] INFO yarn.Client (Logging.scala:logInfo(59)) - Will allocate AM container, with 896 MB memory including 384 MB overhead 2015-06-09 17:02:04,638 [main] INFO yarn.Client (Logging.scala:logInfo(59)) - Setting up container launch context for our AM 2015-06-09 17:02:04,643 [main] INFO yarn.Client (Logging.scala:logInfo(59)) - Preparing resources for our AM container 2015-06-09 17:02:05,096 [main] WARN shortcircuit.DomainSocketFactory (DomainSocketFactory.java:init(116)) - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 2015-06-09 17:02:05,106 [main] DEBUG yarn.YarnSparkHadoopUtil (Logging.scala:logDebug(63)) - delegation token renewer is: rm/devix.cotham.uk@COTHAM 2015-06-09 17:02:05,107 [main] INFO yarn.YarnSparkHadoopUtil (Logging.scala:logInfo(59)) - getting token for namenode: hdfs://devix.cotham.uk:8020/user/stevel/.sparkStaging/application_1433777033372_0005 2015-06-09 17:02:06,129 [main] DEBUG yarn.Client (Logging.scala:logDebug(63)) - HiveMetaStore configured in localmode 2015-06-09 17:02:06,130 [main] DEBUG yarn.Client (Logging.scala:logDebug(63)) - HBase Class not found: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration 2015-06-09 17:02:06,225 [main] INFO yarn.Client
[jira] [Commented] (SPARK-7756) Ensure Spark runs clean on IBM Java implementation
[ https://issues.apache.org/jira/browse/SPARK-7756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580237#comment-14580237 ] Apache Spark commented on SPARK-7756: - User 'a-roberts' has created a pull request for this issue: https://github.com/apache/spark/pull/6740 Ensure Spark runs clean on IBM Java implementation -- Key: SPARK-7756 URL: https://issues.apache.org/jira/browse/SPARK-7756 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Tim Ellison Assignee: Tim Ellison Priority: Minor Fix For: 1.4.0 Spark should run successfully on the IBM Java implementation. This issue is to gather any minor issues seen running the tests and examples that are attributable to differences in Java vendor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8289) Provide a specific stack size with all Java implementations to prevent stack overflows with certain tests
Adam Roberts created SPARK-8289: --- Summary: Provide a specific stack size with all Java implementations to prevent stack overflows with certain tests Key: SPARK-8289 URL: https://issues.apache.org/jira/browse/SPARK-8289 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Environment: Anywhere whereby the Java vendor is not OpenJDK Reporter: Adam Roberts Fix For: 1.5.0 Default stack sizes differ per Java implementation - so tests can pass for those with higher stack sizes (OpenJDK) but will fail with Oracle or IBM Java owing to lower default sizes. In particular we can see this happening with the JavaALSSuite - with 15 iterations, we get stackoverflow errors with Oracle and IBM Java. We don't with OpenJDK. This JIRA aims to address such an issue by providing a default specified stack size to be used for all Java distributions: 4096k specified for both SBT test args and for Maven test args (changing project/ScalaBuild.scala and pom.xml respectively). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8288) ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type
[ https://issues.apache.org/jira/browse/SPARK-8288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580244#comment-14580244 ] Cheng Lian commented on SPARK-8288: --- cc [~zzztimbo] ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type Key: SPARK-8288 URL: https://issues.apache.org/jira/browse/SPARK-8288 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian This ticket is derived from PARQUET-293 (which actually describes a Spark SQL issue). My comment on that issue quoted below: {quote} ... The reason of this exception is that, the Scala code Scrooge generates is actually a trait extending {{Product}}: {code} trait Junk extends ThriftStruct with scala.Product2[Long, String] with java.io.Serializable {code} while Spark expects a case class, something like: {code} case class Junk(junkID: Long, junkString: String) {code} The key difference here is that the latter case class version has a constructor whose arguments can be transformed into fields of the DataFrame schema. The exception was thrown because Spark can't find such a constructor from trait {{Junk}}. {quote} We can make {{ScalaReflection}} try {{apply}} methods in companion objects, so that trait types generated by Scrooge can also be used for Spark SQL schema inference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8286) Rewrite UTF8String in Java and move it into unsafe package.
[ https://issues.apache.org/jira/browse/SPARK-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580160#comment-14580160 ] Apache Spark commented on SPARK-8286: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/6738 Rewrite UTF8String in Java and move it into unsafe package. --- Key: SPARK-8286 URL: https://issues.apache.org/jira/browse/SPARK-8286 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8286) Rewrite UTF8String in Java and move it into unsafe package.
[ https://issues.apache.org/jira/browse/SPARK-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8286: --- Assignee: Reynold Xin (was: Apache Spark) Rewrite UTF8String in Java and move it into unsafe package. --- Key: SPARK-8286 URL: https://issues.apache.org/jira/browse/SPARK-8286 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8286) Rewrite UTF8String in Java and move it into unsafe package.
[ https://issues.apache.org/jira/browse/SPARK-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8286: --- Assignee: Apache Spark (was: Reynold Xin) Rewrite UTF8String in Java and move it into unsafe package. --- Key: SPARK-8286 URL: https://issues.apache.org/jira/browse/SPARK-8286 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7886) Add built-in expressions to FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7886: --- Assignee: Reynold Xin Add built-in expressions to FunctionRegistry Key: SPARK-7886 URL: https://issues.apache.org/jira/browse/SPARK-7886 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Fix For: 1.5.0 Once we do this, we no longer needs to hardcode expressions into the parser (both for internal SQL and Hive QL). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7886) Add built-in expressions to FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580175#comment-14580175 ] Apache Spark commented on SPARK-7886: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/6739 Add built-in expressions to FunctionRegistry Key: SPARK-7886 URL: https://issues.apache.org/jira/browse/SPARK-7886 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Fix For: 1.5.0 Once we do this, we no longer needs to hardcode expressions into the parser (both for internal SQL and Hive QL). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4176) Support decimals with precision 18 in Parquet
[ https://issues.apache.org/jira/browse/SPARK-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580206#comment-14580206 ] Rene Treffer commented on SPARK-4176: - Given that SPARK-7897 loads unsigned bigint as DecimalType.unlimited means you can now load tables with unsigned bigint but won't be able to store them in parquet files :-( Support decimals with precision 18 in Parquet --- Key: SPARK-4176 URL: https://issues.apache.org/jira/browse/SPARK-4176 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Matei Zaharia Assignee: Cheng Lian After https://issues.apache.org/jira/browse/SPARK-3929, only decimals with precisions = 18 (that can be read into a Long) will be readable from Parquet, so we still need more work to support these larger ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4127) Streaming Linear Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4127: --- Assignee: Apache Spark Streaming Linear Regression- Python bindings Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Anant Daksh Asthana Assignee: Apache Spark Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4127) Streaming Linear Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4127: --- Assignee: (was: Apache Spark) Streaming Linear Regression- Python bindings Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Anant Daksh Asthana Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4127) Streaming Linear Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580507#comment-14580507 ] Apache Spark commented on SPARK-4127: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/6744 Streaming Linear Regression- Python bindings Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Anant Daksh Asthana Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8289) Provide a specific stack size with all Java implementations to prevent stack overflows with certain tests
[ https://issues.apache.org/jira/browse/SPARK-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8289: --- Assignee: (was: Apache Spark) Provide a specific stack size with all Java implementations to prevent stack overflows with certain tests - Key: SPARK-8289 URL: https://issues.apache.org/jira/browse/SPARK-8289 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Environment: Anywhere whereby the Java vendor is not OpenJDK Reporter: Adam Roberts Fix For: 1.5.0 Default stack sizes differ per Java implementation - so tests can pass for those with higher stack sizes (OpenJDK) but will fail with Oracle or IBM Java owing to lower default sizes. In particular we can see this happening with the JavaALSSuite - with 15 iterations, we get stackoverflow errors with Oracle and IBM Java. We don't with OpenJDK. This JIRA aims to address such an issue by providing a default specified stack size to be used for all Java distributions: 4096k specified for both SBT test args and for Maven test args (changing project/ScalaBuild.scala and pom.xml respectively). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8289) Provide a specific stack size with all Java implementations to prevent stack overflows with certain tests
[ https://issues.apache.org/jira/browse/SPARK-8289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8289: --- Assignee: Apache Spark Provide a specific stack size with all Java implementations to prevent stack overflows with certain tests - Key: SPARK-8289 URL: https://issues.apache.org/jira/browse/SPARK-8289 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Environment: Anywhere whereby the Java vendor is not OpenJDK Reporter: Adam Roberts Assignee: Apache Spark Fix For: 1.5.0 Default stack sizes differ per Java implementation - so tests can pass for those with higher stack sizes (OpenJDK) but will fail with Oracle or IBM Java owing to lower default sizes. In particular we can see this happening with the JavaALSSuite - with 15 iterations, we get stackoverflow errors with Oracle and IBM Java. We don't with OpenJDK. This JIRA aims to address such an issue by providing a default specified stack size to be used for all Java distributions: 4096k specified for both SBT test args and for Maven test args (changing project/ScalaBuild.scala and pom.xml respectively). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7815) Enable UTF8String to work against memory address directly
[ https://issues.apache.org/jira/browse/SPARK-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7815: --- Summary: Enable UTF8String to work against memory address directly (was: Move UTF8String into Unsafe java package, and have it work against memory address directly) Enable UTF8String to work against memory address directly - Key: SPARK-7815 URL: https://issues.apache.org/jira/browse/SPARK-7815 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu So we can avoid an extra copy of data into byte array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8286) Rewrite UTF8String in Java and move it into unsafe package.
Reynold Xin created SPARK-8286: -- Summary: Rewrite UTF8String in Java and move it into unsafe package. Key: SPARK-8286 URL: https://issues.apache.org/jira/browse/SPARK-8286 Project: Spark Issue Type: Bug Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS
Tao Wang created SPARK-8290: --- Summary: spark class command builder need read SPARK_JAVA_OPTS Key: SPARK-8290 URL: https://issues.apache.org/jira/browse/SPARK-8290 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Tao Wang Priority: Minor SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8286) Rewrite UTF8String in Java and move it into unsafe package.
[ https://issues.apache.org/jira/browse/SPARK-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-8286: -- Assignee: Reynold Xin Rewrite UTF8String in Java and move it into unsafe package. --- Key: SPARK-8286 URL: https://issues.apache.org/jira/browse/SPARK-8286 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8288) ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type
Cheng Lian created SPARK-8288: - Summary: ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type Key: SPARK-8288 URL: https://issues.apache.org/jira/browse/SPARK-8288 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian This ticket is derived from PARQUET-293 (which actually describes a Spark SQL issue). My comment on that issue quoted below: {quote} ... The reason of this exception is that, the Scala code Scrooge generates is actually a trait extending {{Product}}: {code} trait Junk extends ThriftStruct with scala.Product2[Long, String] with java.io.Serializable {code} while Spark expects a case class, something like: {code} case class Junk(junkID: Long, junkString: String) {code} The key difference here is that the latter case class version has a constructor whose arguments can be transformed into fields of the DataFrame schema. The exception was thrown because Spark can't find such a constructor from trait {{Junk}}. {quote} We can make {{ScalaReflection}} try {{apply}} methods in companion objects, so that trait types generated by Scrooge can also be used for Spark SQL schema inference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8287) Filter not push down through Subquery or View
[ https://issues.apache.org/jira/browse/SPARK-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580257#comment-14580257 ] Li Sheng commented on SPARK-8287: - [~marmbrus] [~lian cheng] Filter not push down through Subquery or View - Key: SPARK-8287 URL: https://issues.apache.org/jira/browse/SPARK-8287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.4.0 Original Estimate: 40h Remaining Estimate: 40h Filter not push down through Subquery or View. Assume we have two big partitioned table join inner a Subquery or a View and filter not push down, this will cause a full partition join and will cause performance issues. Let me give and example that can reproduce the problem: create table src(key int, value string); --创建分区表并且导入数据 CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value from src; CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value from src; --创建视图 create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds --QueryExecution select * from dw.src_view where ds='2' sql(select * from dw.src_view where ds='2' ).queryExecution == Parsed Logical Plan == 'Project [*] 'Filter ('ds = 2) 'UnresolvedRelation [dw,src_view], None == Analyzed Logical Plan == Project [s1#60L,s2#61L,ds#62] Filter (ds#62 = 2) Subquery src_view Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Join Inner, Some((ds#63 = ds#66)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) == Optimized Logical Plan == Filter (ds#62 = 2) Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Project [ds#66,key#64,key#67] Join Inner, Some((ds#63 = ds#66)) Project [key#64,ds#63] MetastoreRelation dw, s... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8287) Filter not push down through Subquery or View
Li Sheng created SPARK-8287: --- Summary: Filter not push down through Subquery or View Key: SPARK-8287 URL: https://issues.apache.org/jira/browse/SPARK-8287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.4.0 Filter not push down through Subquery or View. Assume we have two big partitioned table join inner a Subquery or a View and filter not push down, this will cause a full partition join and will cause performance issues. Let me give and example that can reproduce the problem: create table src(key int, value string); --创建分区表并且导入数据 CREATE TABLE src_partitioned1 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned1 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned1 PARTITION (ds='2') select key, value from src; CREATE TABLE src_partitioned2 (key int, value STRING) PARTITIONED BY (ds STRING); insert overwrite table src_partitioned2 PARTITION (ds='1') select key, value from src; insert overwrite table src_partitioned2 PARTITION (ds='2') select key, value from src; --创建视图 create view src_view as select sum(a.key) s1, sum(b.key) s2, b.ds ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds create view src_view_1 as select sum(a.key) s1, sum(b.key) s2, b.ds my_ds from src_partitioned1 a join src_partitioned2 b on a.ds = b.ds group by b.ds --QueryExecution select * from dw.src_view where ds='2' sql(select * from dw.src_view where ds='2' ).queryExecution == Parsed Logical Plan == 'Project [*] 'Filter ('ds = 2) 'UnresolvedRelation [dw,src_view], None == Analyzed Logical Plan == Project [s1#60L,s2#61L,ds#62] Filter (ds#62 = 2) Subquery src_view Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Join Inner, Some((ds#63 = ds#66)) MetastoreRelation dw, src_partitioned1, Some(a) MetastoreRelation dw, src_partitioned2, Some(b) == Optimized Logical Plan == Filter (ds#62 = 2) Aggregate [ds#66], [SUM(CAST(key#64, LongType)) AS s1#60L,SUM(CAST(key#67, LongType)) AS s2#61L,ds#66 AS ds#62] Project [ds#66,key#64,key#67] Join Inner, Some((ds#63 = ds#66)) Project [key#64,ds#63] MetastoreRelation dw, s... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS
[ https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8290: --- Assignee: (was: Apache Spark) spark class command builder need read SPARK_JAVA_OPTS - Key: SPARK-8290 URL: https://issues.apache.org/jira/browse/SPARK-8290 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Tao Wang Priority: Minor SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS
[ https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8290: --- Assignee: Apache Spark spark class command builder need read SPARK_JAVA_OPTS - Key: SPARK-8290 URL: https://issues.apache.org/jira/browse/SPARK-8290 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Tao Wang Assignee: Apache Spark Priority: Minor SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS
[ https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580290#comment-14580290 ] Apache Spark commented on SPARK-8290: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/6741 spark class command builder need read SPARK_JAVA_OPTS - Key: SPARK-8290 URL: https://issues.apache.org/jira/browse/SPARK-8290 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Tao Wang Priority: Minor SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8291) Add parse functionality to LabeledPoint in PySpark
Manoj Kumar created SPARK-8291: -- Summary: Add parse functionality to LabeledPoint in PySpark Key: SPARK-8291 URL: https://issues.apache.org/jira/browse/SPARK-8291 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor It is useful to have functionality that can parse a string into a LabeledPoint while loading files, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8292) ShortestPaths run with error result
Bruce Chen created SPARK-8292: - Summary: ShortestPaths run with error result Key: SPARK-8292 URL: https://issues.apache.org/jira/browse/SPARK-8292 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.1 Environment: Ubuntu 64bit Reporter: Bruce Chen Fix For: 1.3.1 In graphx/lib/ShortestPaths, i run an example with input data: 0\t2 0\t4 2\t3 3\t6 4\t2 4\t5 5\t3 5\t6 then i write a function and set point '0' as the source point, and calculate the shortest path from point 0 to the others points, the code like this: val source: Seq[VertexId] = Seq(0) val ss = ShortestPaths.run(graph, source) then, i get the run result of all the vertex's shortest path value: (4,Map()) (0,Map(0 - 0)) (6,Map()) (3,Map()) (5,Map()) (2,Map()) but the right result should be: (4,Map(0 - 1)) (0,Map(0 - 0)) (6,Map(0 - 3)) (3,Map(0 - 2)) (5,Map(0 - 2)) (2,Map(0 - 1)) so, i check the source code of spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala and find a bug. The patch list in the following. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8291) Add parse functionality to LabeledPoint in PySpark
[ https://issues.apache.org/jira/browse/SPARK-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8291: --- Assignee: Apache Spark Add parse functionality to LabeledPoint in PySpark -- Key: SPARK-8291 URL: https://issues.apache.org/jira/browse/SPARK-8291 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Apache Spark Priority: Minor It is useful to have functionality that can parse a string into a LabeledPoint while loading files, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7179) Add pattern after show tables to filter desire tablename
[ https://issues.apache.org/jira/browse/SPARK-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580527#comment-14580527 ] Apache Spark commented on SPARK-7179: - User 'baishuo' has created a pull request for this issue: https://github.com/apache/spark/pull/6745 Add pattern after show tables to filter desire tablename -- Key: SPARK-7179 URL: https://issues.apache.org/jira/browse/SPARK-7179 Project: Spark Issue Type: New Feature Components: SQL Reporter: baishuo Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7179) Add pattern after show tables to filter desire tablename
[ https://issues.apache.org/jira/browse/SPARK-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7179: --- Assignee: (was: Apache Spark) Add pattern after show tables to filter desire tablename -- Key: SPARK-7179 URL: https://issues.apache.org/jira/browse/SPARK-7179 Project: Spark Issue Type: New Feature Components: SQL Reporter: baishuo Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8291) Add parse functionality to LabeledPoint in PySpark
[ https://issues.apache.org/jira/browse/SPARK-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8291: --- Assignee: (was: Apache Spark) Add parse functionality to LabeledPoint in PySpark -- Key: SPARK-8291 URL: https://issues.apache.org/jira/browse/SPARK-8291 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor It is useful to have functionality that can parse a string into a LabeledPoint while loading files, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8291) Add parse functionality to LabeledPoint in PySpark
[ https://issues.apache.org/jira/browse/SPARK-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580546#comment-14580546 ] Apache Spark commented on SPARK-8291: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/6746 Add parse functionality to LabeledPoint in PySpark -- Key: SPARK-8291 URL: https://issues.apache.org/jira/browse/SPARK-8291 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor It is useful to have functionality that can parse a string into a LabeledPoint while loading files, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8292) ShortestPaths run with error result
[ https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8292: - Priority: Minor (was: Major) Target Version/s: (was: 1.3.1) Fix Version/s: (was: 1.3.1) [~hangc0276] Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark This is not how you propose a patch and you've set some fields that shouldn't be here. You should also explain the fix. ShortestPaths run with error result --- Key: SPARK-8292 URL: https://issues.apache.org/jira/browse/SPARK-8292 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.1 Environment: Ubuntu 64bit Reporter: Bruce Chen Priority: Minor Labels: patch Attachments: ShortestPaths.patch In graphx/lib/ShortestPaths, i run an example with input data: 0\t2 0\t4 2\t3 3\t6 4\t2 4\t5 5\t3 5\t6 then i write a function and set point '0' as the source point, and calculate the shortest path from point 0 to the others points, the code like this: val source: Seq[VertexId] = Seq(0) val ss = ShortestPaths.run(graph, source) then, i get the run result of all the vertex's shortest path value: (4,Map()) (0,Map(0 - 0)) (6,Map()) (3,Map()) (5,Map()) (2,Map()) but the right result should be: (4,Map(0 - 1)) (0,Map(0 - 0)) (6,Map(0 - 3)) (3,Map(0 - 2)) (5,Map(0 - 2)) (2,Map(0 - 1)) so, i check the source code of spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala and find a bug. The patch list in the following. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7179) Add pattern after show tables to filter desire tablename
[ https://issues.apache.org/jira/browse/SPARK-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7179: --- Assignee: Apache Spark Add pattern after show tables to filter desire tablename -- Key: SPARK-7179 URL: https://issues.apache.org/jira/browse/SPARK-7179 Project: Spark Issue Type: New Feature Components: SQL Reporter: baishuo Assignee: Apache Spark Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8292) ShortestPaths run with error result
[ https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Chen updated SPARK-8292: -- Attachment: ShortestPaths.patch ShortestPaths run with error result --- Key: SPARK-8292 URL: https://issues.apache.org/jira/browse/SPARK-8292 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.1 Environment: Ubuntu 64bit Reporter: Bruce Chen Labels: patch Fix For: 1.3.1 Attachments: ShortestPaths.patch In graphx/lib/ShortestPaths, i run an example with input data: 0\t2 0\t4 2\t3 3\t6 4\t2 4\t5 5\t3 5\t6 then i write a function and set point '0' as the source point, and calculate the shortest path from point 0 to the others points, the code like this: val source: Seq[VertexId] = Seq(0) val ss = ShortestPaths.run(graph, source) then, i get the run result of all the vertex's shortest path value: (4,Map()) (0,Map(0 - 0)) (6,Map()) (3,Map()) (5,Map()) (2,Map()) but the right result should be: (4,Map(0 - 1)) (0,Map(0 - 0)) (6,Map(0 - 3)) (3,Map(0 - 2)) (5,Map(0 - 2)) (2,Map(0 - 1)) so, i check the source code of spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala and find a bug. The patch list in the following. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8272) BigDecimal in parquet not working
[ https://issues.apache.org/jira/browse/SPARK-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580583#comment-14580583 ] Liang-Chi Hsieh commented on SPARK-8272: Can you provide more information? Such as example codes, your data schema, etc. BigDecimal in parquet not working - Key: SPARK-8272 URL: https://issues.apache.org/jira/browse/SPARK-8272 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.3.1 Environment: Ubuntu 14.0 LTS Reporter: Bipin Roshan Nag Labels: sparksql When trying to save a DDF to parquet file I get the following errror: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 311, localhost): java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to org.apache.spark.sql.types.Decimal at org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220) at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:671) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) I cannot save the dataframe. Please help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8284) Regualarized Extreme Learning Machine for MLLib
[ https://issues.apache.org/jira/browse/SPARK-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580602#comment-14580602 ] Rakesh Chalasani commented on SPARK-8284: - In my experience, the performance and ways to build ELMs is not very well understood to include in MLlib at this time. So, it is better to put this in Spark packages. See: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines Regualarized Extreme Learning Machine for MLLib --- Key: SPARK-8284 URL: https://issues.apache.org/jira/browse/SPARK-8284 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.1 Reporter: 李力 Extreme Learning Machine can get better generalization performance at a much faster learning speed for regression and classification problem,but the enlarging volume of datasets makes regression by ELM on very large scale datasets a challenging task. Through analyzing the mechanism of ELM algorithm , an efficient parallel ELM for regression is designed and implemented based on Spark. The experimental results demonstrate that the propose parallel ELM for regression can efficiently handle very large dataset with a good performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7961) Redesign SQLConf for better error message reporting
[ https://issues.apache.org/jira/browse/SPARK-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7961: --- Assignee: Apache Spark Redesign SQLConf for better error message reporting --- Key: SPARK-7961 URL: https://issues.apache.org/jira/browse/SPARK-7961 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Priority: Critical Right now, we don't validate config values and as a result will throw exceptions when queries or DataFrame operations are run. Imagine if one user sets config variable spark.sql.retainGroupColumns (requires true, false) to hello. The set action itself will complete fine. When another user runs a query, it will throw the following exception: {code} java.lang.IllegalArgumentException: For input string: hello at scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:238) at scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:226) at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:31) at org.apache.spark.sql.SQLConf.dataFrameRetainGroupColumns(SQLConf.scala:265) at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:74) at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:227) {code} This is highly confusing. We should redesign SQLConf to validate data input at set time (during setConf call). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7961) Redesign SQLConf for better error message reporting
[ https://issues.apache.org/jira/browse/SPARK-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7961: --- Assignee: (was: Apache Spark) Redesign SQLConf for better error message reporting --- Key: SPARK-7961 URL: https://issues.apache.org/jira/browse/SPARK-7961 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Priority: Critical Right now, we don't validate config values and as a result will throw exceptions when queries or DataFrame operations are run. Imagine if one user sets config variable spark.sql.retainGroupColumns (requires true, false) to hello. The set action itself will complete fine. When another user runs a query, it will throw the following exception: {code} java.lang.IllegalArgumentException: For input string: hello at scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:238) at scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:226) at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:31) at org.apache.spark.sql.SQLConf.dataFrameRetainGroupColumns(SQLConf.scala:265) at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:74) at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:227) {code} This is highly confusing. We should redesign SQLConf to validate data input at set time (during setConf call). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8273) Driver hangs up when yarn shutdown in client mode
[ https://issues.apache.org/jira/browse/SPARK-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-8273: --- Affects Version/s: 1.4.0 1.3.1 Driver hangs up when yarn shutdown in client mode - Key: SPARK-8273 URL: https://issues.apache.org/jira/browse/SPARK-8273 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.3.1, 1.4.0 Reporter: Tao Wang In client mode, if yarn was shut down with spark application running, the application will hang up after several retries(default: 30) because the exception throwed by YarnClientImpl could not be caught by upper level, we should exit in case that user can not be aware that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8018) KMeans should accept initial cluster centers as param
[ https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8018: --- Assignee: Apache Spark (was: Meethu Mathew) KMeans should accept initial cluster centers as param - Key: SPARK-8018 URL: https://issues.apache.org/jira/browse/SPARK-8018 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Apache Spark KMeans should allow model initialization using an existing set of cluster centers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8018) KMeans should accept initial cluster centers as param
[ https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580128#comment-14580128 ] Apache Spark commented on SPARK-8018: - User 'FlytxtRnD' has created a pull request for this issue: https://github.com/apache/spark/pull/6737 KMeans should accept initial cluster centers as param - Key: SPARK-8018 URL: https://issues.apache.org/jira/browse/SPARK-8018 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Meethu Mathew KMeans should allow model initialization using an existing set of cluster centers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8018) KMeans should accept initial cluster centers as param
[ https://issues.apache.org/jira/browse/SPARK-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8018: --- Assignee: Meethu Mathew (was: Apache Spark) KMeans should accept initial cluster centers as param - Key: SPARK-8018 URL: https://issues.apache.org/jira/browse/SPARK-8018 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Meethu Mathew KMeans should allow model initialization using an existing set of cluster centers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly
[ https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang updated SPARK-8290: Description: SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. The missing part is here: https://github.com/apache/spark/blob/1c30afdf94b27e1ad65df0735575306e65d148a1/bin/spark-class#L97. was: SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. The missing part is here. spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly -- Key: SPARK-8290 URL: https://issues.apache.org/jira/browse/SPARK-8290 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Tao Wang Priority: Minor SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. The missing part is here: https://github.com/apache/spark/blob/1c30afdf94b27e1ad65df0735575306e65d148a1/bin/spark-class#L97. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8212) math function: e
[ https://issues.apache.org/jira/browse/SPARK-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8212. Resolution: Fixed Fix Version/s: 1.5.0 math function: e Key: SPARK-8212 URL: https://issues.apache.org/jira/browse/SPARK-8212 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang Fix For: 1.5.0 e(): double Returns the value of e. We should make this foldable so it gets folded by the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580695#comment-14580695 ] Marcelo Vanzin commented on SPARK-8142: --- bq. If I mark my spark core as provided right now, as we speak , my code compiles but when I run my application in my IDE using Spark local I get: NoClassFoundError Not that this necessarily helps you, but that sounds like either a bug in your IDE or in how you've set it up to run your app. Because provided dependencies show up in test scope too; so I'd expect that when you're running something from an IDE it would behave similarly. re: SPARK-1867, that's a very generic problem that has multiple root causes. e.g. using a version of hbase libraries on the client side that is not compatible with the server side could lead to the same bug. Spark Job Fails with ResultTask ClassCastException -- Key: SPARK-8142 URL: https://issues.apache.org/jira/browse/SPARK-8142 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Dev Lakhani When running a Spark Job, I get no failures in the application code whatsoever but a weird ResultTask Class exception. In my job, I create a RDD from HBase and for each partition do a REST call on an API, using a REST client. This has worked in IntelliJ but when I deploy to a cluster using spark-submit.sh I get : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) These are the configs I set to override the spark classpath because I want to use my own glassfish jersey version: sparkConf.set(spark.driver.userClassPathFirst,true); sparkConf.set(spark.executor.userClassPathFirst,true); I see no other warnings or errors in any of the logs. Unfortunately I cannot post my code, but please ask me questions that will help debug the issue. Using spark 1.3.1 hadoop 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly
[ https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang updated SPARK-8290: Summary: spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly (was: spark class command builder need read SPARK_JAVA_OPTS) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly -- Key: SPARK-8290 URL: https://issues.apache.org/jira/browse/SPARK-8290 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Tao Wang Priority: Minor SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly
[ https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang updated SPARK-8290: Description: SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. The missing part is here. was:SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly -- Key: SPARK-8290 URL: https://issues.apache.org/jira/browse/SPARK-8290 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Tao Wang Priority: Minor SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. The missing part is here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580689#comment-14580689 ] Dev Lakhani commented on SPARK-8142: To clarify [~srowen] 1) I meant the other way around, if we choose to use Apache Spark, which provides Apache Hadoop libs and we then choose a Cloudera Hadoop distribution on our (the rest of our) cluster and use Cloudera Hadoop clients in the application code. Spark will provide Apache Hadoop libs whereas our cluster will be cdh5. Is there any issue in doing this? We choose to use Apache Spark because the CDH is a version behind the official Spark release and we don't want to wait for say Dataframes support. 2) If I mark my spark core as provided right now, as we speak , my code compiles but when I run my application in my IDE using Spark local I get: NoClassFoundError org/apache/spark/api/java/function/Function this is why I am suggesting whether we need maven profiles, one for local testing and one for deployment? So getting back to the issue raised in this JIRA, which we seem to be ignoring, even when Hadoop and Spark is provided and Hbase client/protocol/server is packaged we run into SPARK-1867 which at latest comment suggests a dependency is missing and this results in the obscure exception. Whether this is on the Hadoop side or Spark side is not known but as the JIRA suggests it was caused by a missing dependency. I cannot see this missing class/dependency exception anywhere in the spark logs. This suggests that if anyone using Spark sets any of the userClasspath* misses out a primary, secondary or tertiary dependency they will encounter SPARK-1867. Therefore we are stuck, any suggestions are welcome to overcome this. Either there is a need make ChildFirstURLClassLoader ignore Spark and Hadoop libs or help spark log what's causing SPARK-1867. Spark Job Fails with ResultTask ClassCastException -- Key: SPARK-8142 URL: https://issues.apache.org/jira/browse/SPARK-8142 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Dev Lakhani When running a Spark Job, I get no failures in the application code whatsoever but a weird ResultTask Class exception. In my job, I create a RDD from HBase and for each partition do a REST call on an API, using a REST client. This has worked in IntelliJ but when I deploy to a cluster using spark-submit.sh I get : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) These are the configs I set to override the spark classpath because I want to use my own glassfish jersey version: sparkConf.set(spark.driver.userClassPathFirst,true); sparkConf.set(spark.executor.userClassPathFirst,true); I see no other warnings or errors in any of the logs. Unfortunately I cannot post my code, but please ask me questions that will help debug the issue. Using spark 1.3.1 hadoop 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580719#comment-14580719 ] Sean Owen commented on SPARK-8142: -- (FWIW CDH ships the current latest release. You can always run your own build on CDH when you like as it's just a YARN-based app. You can also use anything in Spark, like DataFrames, in the provided build. It's not different.) In this scenario Spark doesn't provide anything Hadoop related; it's just Spark. Spark uses Hadoop code; CDH is Hadoop. I don't see what the problem you're expecting? You can't run your app outside of Spark if it's provided since the Spark deployment model is to run an app using spark-submit. Yes I can see why your test-scope dependencies would be, well, test scope and not provided for Spark, since you want to run everything in one JVM. That's easy though. Spark Job Fails with ResultTask ClassCastException -- Key: SPARK-8142 URL: https://issues.apache.org/jira/browse/SPARK-8142 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Dev Lakhani When running a Spark Job, I get no failures in the application code whatsoever but a weird ResultTask Class exception. In my job, I create a RDD from HBase and for each partition do a REST call on an API, using a REST client. This has worked in IntelliJ but when I deploy to a cluster using spark-submit.sh I get : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) These are the configs I set to override the spark classpath because I want to use my own glassfish jersey version: sparkConf.set(spark.driver.userClassPathFirst,true); sparkConf.set(spark.executor.userClassPathFirst,true); I see no other warnings or errors in any of the logs. Unfortunately I cannot post my code, but please ask me questions that will help debug the issue. Using spark 1.3.1 hadoop 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580670#comment-14580670 ] Marcelo Vanzin commented on SPARK-8142: --- So, using {{userClassPathFirst}} is tricky exactly because of these issues. You have to be super careful when you have classes in your app that cross the class loader boundaries. Two general comments: - if you want to use the glassfish jersey version, you shouldn't need to do this, right? Spark depends on the old one that is under com.sun.*, IIRC. - marking all dependencies (including hbase) as provided and using {{spark.{driver,executor}.extraClassPath}} might be the easiest way out if you really need to use {{userClassPathFirst}}. Basically the class cast exceptions you're getting are because you have the same class in both Spark's class loader and your app's class loader, and those classes need to cross that boundary. So if you make sure in your app's build that these conflicts do not occur, then using {{userClassPathFirst}} should work. Sorry I don't have a better suggestion; it's just not that trivial of a problem. :-/ We could make the child-first class loader configurable, so that you can set a sort of blacklist of packages where it should look at the parent first, but that would still require people to fiddle with configurations to make things work. Spark Job Fails with ResultTask ClassCastException -- Key: SPARK-8142 URL: https://issues.apache.org/jira/browse/SPARK-8142 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Dev Lakhani When running a Spark Job, I get no failures in the application code whatsoever but a weird ResultTask Class exception. In my job, I create a RDD from HBase and for each partition do a REST call on an API, using a REST client. This has worked in IntelliJ but when I deploy to a cluster using spark-submit.sh I get : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) These are the configs I set to override the spark classpath because I want to use my own glassfish jersey version: sparkConf.set(spark.driver.userClassPathFirst,true); sparkConf.set(spark.executor.userClassPathFirst,true); I see no other warnings or errors in any of the logs. Unfortunately I cannot post my code, but please ask me questions that will help debug the issue. Using spark 1.3.1 hadoop 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580661#comment-14580661 ] Janani Mukundan commented on SPARK-5575: Hi Alexander, I forked your latest version of https://github.com/avulanov/spark/tree/ann-interface-gemm. I would like to contribute to the MLlib by adding an implementation of a DBN. I have a scala implementation working right now. I am going to try and merge it with your ANN models. Thanks Janani Artificial neural networks for MLlib deep learning -- Key: SPARK-5575 URL: https://issues.apache.org/jira/browse/SPARK-5575 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Goal: Implement various types of artificial neural networks Motivation: deep learning trend Requirements: 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused 2) Implement complex abstractions, such as feed forward and recurrent networks 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), autoencoder (sparse and denoising), stacked autoencoder, restricted boltzmann machines (RBM), deep belief networks (DBN) etc. 4) Implement or reuse supporting constucts, such as classifiers, normalizers, poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580730#comment-14580730 ] Dev Lakhani commented on SPARK-8142: Hi [~vanzin] bq. if you want to use the glassfish jersey version, you shouldn't need to do this, right? Spark depends on the old one that is under com.sun.*, IIRC. Yes I need to make use of glassfish 2.x in my application and not the sun.* one provided, but this could apply to any other dependency that needs to supersede Sparks provided etc. bq. marking all dependencies (including hbase) as provided and using {{spark. {driver,executor}.extraClassPath}} might be the easiest way out if you really need to use userClassPathFirst. This is an option but might be a challenge to scale if we have different folder layouts for the extraClassPath in different clusters/nodes for hbase and hadoop installs. This can be (and usually is) the case when new servers are added to existing ones for example. If we had /disk4/path/to/hbase/libs and the other has /disk3/another/path/to/hbase/libs and so on then the extraClassPath will need to include these both and grow significantly and spark submit args along with it. Also when we update Hbase these then have to change this classpath each time. Maybe the ideal way is to have, as you suggest, a blacklist which would contain spark and hadoop libs. Then we could put whatever we wanted into one uber/fat jar and it doesn't matter where Hbase and Hadoop are installed or what's provided and compiled, but we let spark work it out. These are just my thoughts, I'm sure others will have different preferences and/or better approaches. Thanks anyway for your input on this JIRA. Spark Job Fails with ResultTask ClassCastException -- Key: SPARK-8142 URL: https://issues.apache.org/jira/browse/SPARK-8142 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Dev Lakhani When running a Spark Job, I get no failures in the application code whatsoever but a weird ResultTask Class exception. In my job, I create a RDD from HBase and for each partition do a REST call on an API, using a REST client. This has worked in IntelliJ but when I deploy to a cluster using spark-submit.sh I get : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host): java.lang.ClassCastException: org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) These are the configs I set to override the spark classpath because I want to use my own glassfish jersey version: sparkConf.set(spark.driver.userClassPathFirst,true); sparkConf.set(spark.executor.userClassPathFirst,true); I see no other warnings or errors in any of the logs. Unfortunately I cannot post my code, but please ask me questions that will help debug the issue. Using spark 1.3.1 hadoop 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8212) math function: e
[ https://issues.apache.org/jira/browse/SPARK-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8212: --- Labels: missing-python (was: ) math function: e Key: SPARK-8212 URL: https://issues.apache.org/jira/browse/SPARK-8212 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang Labels: missing-python Fix For: 1.5.0 e(): double Returns the value of e. We should make this foldable so it gets folded by the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6284) Support framework authentication and role in Mesos framework
[ https://issues.apache.org/jira/browse/SPARK-6284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580780#comment-14580780 ] DarinJ commented on SPARK-6284: --- +1 we're currently patching spark with similar code to make it play nice on our cluster. Would be great to have the ability out of the box. Support framework authentication and role in Mesos framework Key: SPARK-6284 URL: https://issues.apache.org/jira/browse/SPARK-6284 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Support framework authentication and role in both Coarse grain and fine grain mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8215) math function: pi
[ https://issues.apache.org/jira/browse/SPARK-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8215. Resolution: Fixed Fix Version/s: 1.5.0 math function: pi - Key: SPARK-8215 URL: https://issues.apache.org/jira/browse/SPARK-8215 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang Fix For: 1.5.0 pi(): double Returns the value of pi. We should make sure foldable = true so it gets folded by the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8215) math function: pi
[ https://issues.apache.org/jira/browse/SPARK-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8215: --- Labels: missing-python (was: ) math function: pi - Key: SPARK-8215 URL: https://issues.apache.org/jira/browse/SPARK-8215 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang Labels: missing-python Fix For: 1.5.0 pi(): double Returns the value of pi. We should make sure foldable = true so it gets folded by the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8295) SparkContext shut down in Spark Project Streaming test suite
[ https://issues.apache.org/jira/browse/SPARK-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8295: - Priority: Minor (was: Blocker) Definitely not a blocker, but worth tracking down if it can be reproduced elsewhere. The thing it, these tests appear to be passing elsewhere right now on Jenkins, including for Hadoop 2.4. SparkContext shut down in Spark Project Streaming test suite Key: SPARK-8295 URL: https://issues.apache.org/jira/browse/SPARK-8295 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.1 Environment: * Ubuntu 12.04.05 LTS (GNU/Linux 3.13.0-53-generic x86_64) running on VM * java version 1.8.0_45 * Spark 1.3.1, branch 1.3 Reporter: Francois-Xavier Lemire Priority: Minor *Command to build Spark:* build/mvn -Pyarn -Phadoop-2.4 -Pspark-ganglia-lgpl -Dhadoop.version=2.4.0 -DskipTests clean package *Command to test Spark:* build/mvn -Pyarn -Phadoop-2.4 -Pspark-ganglia-lgpl -Dhadoop.version=2.4.0 test *Error:* Tests fail at 'Spark Project Streaming'. *Log:* [...] - awaitTermination - awaitTermination after stop - awaitTermination with error in task - awaitTermination with error in job generation - awaitTerminationOrTimeout Exception in thread Thread-1053 org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:699) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:698) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:698) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1411) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1346) at org.apache.spark.SparkContext.stop(SparkContext.scala:1398) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:580) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:555) at org.apache.spark.streaming.testPackage.package$.test(StreamingContextSuite.scala:437) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$25.apply$mcV$sp(StreamingContextSuite.scala:332) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$25.apply(StreamingContextSuite.scala:332) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$25.apply(StreamingContextSuite.scala:332) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:35) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:35) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at
[jira] [Created] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN
Mridul Muralidharan created SPARK-8297: -- Summary: Scheduler backend is not notified in case node fails in YARN Key: SPARK-8297 URL: https://issues.apache.org/jira/browse/SPARK-8297 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.1, 1.2.2, 1.4.1 Environment: Spark on yarn - both client and cluster mode. Reporter: Mridul Muralidharan Assignee: Mridul Muralidharan Priority: Critical When a node crashes, yarn detects the failure and notifies spark - but this information is not propagated to scheduler backend (unlike in mesos mode, for example). It results in repeated re-execution of stages (due to FetchFailedException on shuffle side), resulting finally in application failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8282) Make number of threads used in RBackend configurable
[ https://issues.apache.org/jira/browse/SPARK-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8282. Resolution: Fixed Fix Version/s: 1.5.0 Assignee: Hossein Falaki Target Version/s: 1.4.1, 1.5.0 (was: 1.4.1) Make number of threads used in RBackend configurable Key: SPARK-8282 URL: https://issues.apache.org/jira/browse/SPARK-8282 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Hossein Falaki Assignee: Hossein Falaki Fix For: 1.4.1, 1.5.0 RBackend starts a netty server which uses two threads. The number of threads is hardcoded. It is useful to have it configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError
ABHISHEK CHOUDHARY created SPARK-8296: - Summary: Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError Key: SPARK-8296 URL: https://issues.apache.org/jira/browse/SPARK-8296 Project: Spark Issue Type: Bug Affects Versions: 1.3.1 Reporter: ABHISHEK CHOUDHARY While trying to load a json file using sqlcontext in prebuilt spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext() sqlContext = SQLContext(sc) # Create the DataFrame df = sqlContext.jsonFile(changes.json) # Show the content of the DataFrame df.show() Error thrown - File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, line 11, in module df = sqlContext.jsonFile(changes.json) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py, line 377, in jsonFile df = self._ssql_ctx.jsonFile(path, samplingRatio) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN
[ https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-8297: --- Assignee: (was: Mridul Muralidharan) Scheduler backend is not notified in case node fails in YARN Key: SPARK-8297 URL: https://issues.apache.org/jira/browse/SPARK-8297 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.2, 1.3.1, 1.4.1 Environment: Spark on yarn - both client and cluster mode. Reporter: Mridul Muralidharan Priority: Critical When a node crashes, yarn detects the failure and notifies spark - but this information is not propagated to scheduler backend (unlike in mesos mode, for example). It results in repeated re-execution of stages (due to FetchFailedException on shuffle side), resulting finally in application failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8064) Upgrade Hive to 1.2
[ https://issues.apache.org/jira/browse/SPARK-8064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580886#comment-14580886 ] Apache Spark commented on SPARK-8064: - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/6748 Upgrade Hive to 1.2 --- Key: SPARK-8064 URL: https://issues.apache.org/jira/browse/SPARK-8064 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Steve Loughran Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6419) GenerateOrdering does not support BinaryType and complex types.
[ https://issues.apache.org/jira/browse/SPARK-6419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580900#comment-14580900 ] Davies Liu commented on SPARK-6419: --- Fixed by SPARK-7956 GenerateOrdering does not support BinaryType and complex types. --- Key: SPARK-6419 URL: https://issues.apache.org/jira/browse/SPARK-6419 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Davies Liu When user want to order by binary columns or columns with complex types and code gen is enabled, there will be a MatchError ([see here|https://github.com/apache/spark/blob/v1.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala#L45]). We can either add supports for these types or have a function to check if we can safely call GenerateOrdering (like the canBeCodeGened for HashAggregation Strategy). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7996) Deprecate the developer api SparkEnv.actorSystem
[ https://issues.apache.org/jira/browse/SPARK-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7996. Resolution: Fixed Fix Version/s: 1.5.0 Assignee: Ilya Ganelin Deprecate the developer api SparkEnv.actorSystem Key: SPARK-7996 URL: https://issues.apache.org/jira/browse/SPARK-7996 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu Assignee: Ilya Ganelin Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7996) Deprecate the developer api SparkEnv.actorSystem
[ https://issues.apache.org/jira/browse/SPARK-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580912#comment-14580912 ] Reynold Xin commented on SPARK-7996: [~ilganeli] that one is for later versions. Deprecate the developer api SparkEnv.actorSystem Key: SPARK-7996 URL: https://issues.apache.org/jira/browse/SPARK-7996 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu Assignee: Ilya Ganelin Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8103) DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures
[ https://issues.apache.org/jira/browse/SPARK-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8103: --- Assignee: Apache Spark (was: Imran Rashid) DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures --- Key: SPARK-8103 URL: https://issues.apache.org/jira/browse/SPARK-8103 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 1.4.0 Reporter: Imran Rashid Assignee: Apache Spark When there is a fetch failure, {{DAGScheduler}} is supposed to fail the stage, retry the necessary portions of the preceding shuffle stage which generated the shuffle data, and eventually rerun the stage. We generally expect to get multiple fetch failures together, but only want to re-start the stage once. The code already makes an attempt to address this https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1108 . {code} // It is likely that we receive multiple FetchFailed for a single stage (because we have // multiple tasks running concurrently on different executors). In that case, it is possible // the fetch failure has already been handled by the scheduler. if (runningStages.contains(failedStage)) { {code} However, this logic is flawed because the stage may have been **resubmitted** by the time we get these fetch failures. In that case, {{runningStages.contains(failedStage)}} will be true, but we've already handled these failures. This results in multiple concurrent non-zombie attempts for one stage. In addition to being very confusing, and a waste of resources, this also can lead to later stages being submitted before the previous stage has registered its map output. This happens because (a) when one attempt finishes all its tasks, it may not register its map output because the stage still has pending tasks, from other attempts https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1046 {code} if (runningStages.contains(shuffleStage) shuffleStage.pendingTasks.isEmpty) { {code} and (b) {{submitStage}} thinks the following stage is ready to go, because {{getMissingParentStages}} thinks the stage is complete as long it has all of its map outputs: https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L397 {code} if (!mapStage.isAvailable) { missing += mapStage } {code} So the following stage is submitted repeatedly, but it is doomed to fail because its shuffle output has never been registered with the map output tracker. Here's an example failure in this case: {noformat} WARN TaskSetManager: Lost task 5.0 in stage 3.2 (TID 294, 192.168.1.104): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=5, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing output locations for shuffle ... {noformat} Note that this is a subset of the problems originally described in SPARK-7308, limited to just the issues effecting the DAGScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7261) Change default log level to WARN in the REPL
[ https://issues.apache.org/jira/browse/SPARK-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7261. Resolution: Fixed Fix Version/s: 1.5.0 Change default log level to WARN in the REPL Key: SPARK-7261 URL: https://issues.apache.org/jira/browse/SPARK-7261 Project: Spark Issue Type: Improvement Components: Spark Shell Reporter: Patrick Wendell Assignee: Shixiong Zhu Priority: Blocker Labels: starter Fix For: 1.5.0 We should add a log4j properties file for the repl (log4j-defaults-repl.properties) that has the level of warning. The main reason for doing this is that we now display nice progress bars in the REPL so the need for task level INFO messages is much less. The best way to accomplish this is the following: 1. Add a second logging defaults file called log4j-defaults-repl.properties that has log level WARN. https://github.com/apache/spark/blob/branch-1.4/core/src/main/resources/org/apache/spark/log4j-defaults.properties 2. When logging is initialized, check whether you are inside the REPL. If so, then use that one: https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/Logging.scala#L124 3. The printed message should say something like: Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel(INFO) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7527) Wrong detection of REPL mode in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7527. Resolution: Fixed Fix Version/s: 1.5.0 Assignee: Shixiong Zhu Target Version/s: 1.5.0 Wrong detection of REPL mode in ClosureCleaner -- Key: SPARK-7527 URL: https://issues.apache.org/jira/browse/SPARK-7527 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Oleksii Kostyliev Assignee: Shixiong Zhu Priority: Minor Fix For: 1.5.0 If REPL class is not present on the classpath, the {{inIntetpreter}} boolean switch shall be {{false}}, not {{true}} at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L247 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5162) Python yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5162. Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Lianhui Wang Target Version/s: 1.4.0 Python yarn-cluster mode Key: SPARK-5162 URL: https://issues.apache.org/jira/browse/SPARK-5162 Project: Spark Issue Type: New Feature Components: PySpark, YARN Reporter: Dana Klassen Assignee: Lianhui Wang Labels: cluster, python, yarn Fix For: 1.4.0 Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would be great to be able to submit python applications to the cluster and (just like java classes) have the resource manager setup an AM on any node in the cluster. Does anyone know the issues blocking this feature? I was snooping around with enabling python apps: Removing the logic stopping python and yarn-cluster from sparkSubmit.scala ... // The following modes are not supported or applicable (clusterManager, deployMode) match { ... case (_, CLUSTER) if args.isPython = printErrorAndExit(Cluster deploy mode is currently not supported for python applications.) ... } … and submitting application via: HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster --num-executors 2 —-py-files {{insert location of egg here}} --executor-cores 1 ../tools/canary.py Everything looks to run alright, pythonRunner is picked up as main class, resources get setup, yarn client gets launched but falls flat on its face: 2015-01-08 18:48:03,444 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: DEBUG: FAILED { {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 1420742868009, FILE, null }, Resource {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed on src filesystem (expected 1420742868009, was 1420742869284 and 2015-01-08 18:48:03,446 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(-/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) transitioned from DOWNLOADING to FAILED Tracked this down to the apache hadoop code(FSDownload.java line 249) related to container localization of files upon downloading. At this point thought it would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5479) PySpark on yarn mode need to support non-local python files
[ https://issues.apache.org/jira/browse/SPARK-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5479. Resolution: Fixed Fix Version/s: 1.5.0 Assignee: Marcelo Vanzin Target Version/s: 1.5.0 PySpark on yarn mode need to support non-local python files --- Key: SPARK-5479 URL: https://issues.apache.org/jira/browse/SPARK-5479 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Reporter: Lianhui Wang Assignee: Marcelo Vanzin Fix For: 1.5.0 In SPARK-5162 [~vgrigor] reports this: Now following code cannot work: aws emr add-steps --cluster-id j-XYWIXMD234 \ --steps Name=SparkPi,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://mybucketat.amazonaws.com/tasks/main.py,main.py,param1],ActionOnFailure=CONTINUE so we need to support non-local python files on yarn client and cluster mode. before submitting application to Yarn, we need to download non-local files to local or hdfs path. or spark.yarn.dist.files need to support other non-local files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly
[ https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8290. Resolution: Fixed Fix Version/s: 1.5.0 Assignee: Tao Wang Target Version/s: 1.5.0 spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly -- Key: SPARK-8290 URL: https://issues.apache.org/jira/browse/SPARK-8290 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Tao Wang Assignee: Tao Wang Priority: Minor Fix For: 1.5.0 SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. The missing part is here: https://github.com/apache/spark/blob/1c30afdf94b27e1ad65df0735575306e65d148a1/bin/spark-class#L97. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8290) spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly
[ https://issues.apache.org/jira/browse/SPARK-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8290: - Affects Version/s: 1.3.0 spark class command builder need read SPARK_JAVA_OPTS and SPARK_DRIVER_MEMORY properly -- Key: SPARK-8290 URL: https://issues.apache.org/jira/browse/SPARK-8290 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Tao Wang Priority: Minor Fix For: 1.5.0 SPARK_JAVA_OPTS was missed in reconstructing the launcher part, we should add it back so spark-class could read it. The missing part is here: https://github.com/apache/spark/blob/1c30afdf94b27e1ad65df0735575306e65d148a1/bin/spark-class#L97. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError
[ https://issues.apache.org/jira/browse/SPARK-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-8296: -- Description: While trying to load a json file using sqlcontext in prebuilt spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext() sqlContext = SQLContext(sc) # Create the DataFrame df = sqlContext.jsonFile(changes.json) # Show the content of the DataFrame df.show() Error thrown - File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, line 11, in module df = sqlContext.jsonFile(changes.json) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py, line 377, in jsonFile df = self._ssql_ctx.jsonFile(path, samplingRatio) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError On checking through the source code, I found that 'gateway_client' is not valid . was: While trying to load a json file using sqlcontext in prebuilt spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext() sqlContext = SQLContext(sc) # Create the DataFrame df = sqlContext.jsonFile(changes.json) # Show the content of the DataFrame df.show() Error thrown - File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, line 11, in module df = sqlContext.jsonFile(changes.json) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py, line 377, in jsonFile df = self._ssql_ctx.jsonFile(path, samplingRatio) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError -- Key: SPARK-8296 URL: https://issues.apache.org/jira/browse/SPARK-8296 Project: Spark Issue Type: Bug Affects Versions: 1.3.1 Reporter: ABHISHEK CHOUDHARY Labels: test While trying to load a json file using sqlcontext in prebuilt spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext() sqlContext = SQLContext(sc) # Create the DataFrame df = sqlContext.jsonFile(changes.json) # Show the content of the DataFrame df.show() Error thrown - File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, line 11, in module df = sqlContext.jsonFile(changes.json) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py, line 377, in jsonFile df = self._ssql_ctx.jsonFile(path, samplingRatio) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError On checking through the source code, I found that 'gateway_client' is not valid . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7186) Decouple internal Row from external Row
[ https://issues.apache.org/jira/browse/SPARK-7186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-7186: - Assignee: Davies Liu Decouple internal Row from external Row --- Key: SPARK-7186 URL: https://issues.apache.org/jira/browse/SPARK-7186 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Priority: Blocker Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations. As a first step, we should just create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we should switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users. After this, we can start removing methods that don't make sense for InternalRow (in a separate ticket). This is probably one of the most important refactoring in Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8103) DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures
[ https://issues.apache.org/jira/browse/SPARK-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581019#comment-14581019 ] Apache Spark commented on SPARK-8103: - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/6750 DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures --- Key: SPARK-8103 URL: https://issues.apache.org/jira/browse/SPARK-8103 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 1.4.0 Reporter: Imran Rashid Assignee: Imran Rashid When there is a fetch failure, {{DAGScheduler}} is supposed to fail the stage, retry the necessary portions of the preceding shuffle stage which generated the shuffle data, and eventually rerun the stage. We generally expect to get multiple fetch failures together, but only want to re-start the stage once. The code already makes an attempt to address this https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1108 . {code} // It is likely that we receive multiple FetchFailed for a single stage (because we have // multiple tasks running concurrently on different executors). In that case, it is possible // the fetch failure has already been handled by the scheduler. if (runningStages.contains(failedStage)) { {code} However, this logic is flawed because the stage may have been **resubmitted** by the time we get these fetch failures. In that case, {{runningStages.contains(failedStage)}} will be true, but we've already handled these failures. This results in multiple concurrent non-zombie attempts for one stage. In addition to being very confusing, and a waste of resources, this also can lead to later stages being submitted before the previous stage has registered its map output. This happens because (a) when one attempt finishes all its tasks, it may not register its map output because the stage still has pending tasks, from other attempts https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1046 {code} if (runningStages.contains(shuffleStage) shuffleStage.pendingTasks.isEmpty) { {code} and (b) {{submitStage}} thinks the following stage is ready to go, because {{getMissingParentStages}} thinks the stage is complete as long it has all of its map outputs: https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L397 {code} if (!mapStage.isAvailable) { missing += mapStage } {code} So the following stage is submitted repeatedly, but it is doomed to fail because its shuffle output has never been registered with the map output tracker. Here's an example failure in this case: {noformat} WARN TaskSetManager: Lost task 5.0 in stage 3.2 (TID 294, 192.168.1.104): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=5, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing output locations for shuffle ... {noformat} Note that this is a subset of the problems originally described in SPARK-7308, limited to just the issues effecting the DAGScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8103) DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures
[ https://issues.apache.org/jira/browse/SPARK-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8103: --- Assignee: Imran Rashid (was: Apache Spark) DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures --- Key: SPARK-8103 URL: https://issues.apache.org/jira/browse/SPARK-8103 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 1.4.0 Reporter: Imran Rashid Assignee: Imran Rashid When there is a fetch failure, {{DAGScheduler}} is supposed to fail the stage, retry the necessary portions of the preceding shuffle stage which generated the shuffle data, and eventually rerun the stage. We generally expect to get multiple fetch failures together, but only want to re-start the stage once. The code already makes an attempt to address this https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1108 . {code} // It is likely that we receive multiple FetchFailed for a single stage (because we have // multiple tasks running concurrently on different executors). In that case, it is possible // the fetch failure has already been handled by the scheduler. if (runningStages.contains(failedStage)) { {code} However, this logic is flawed because the stage may have been **resubmitted** by the time we get these fetch failures. In that case, {{runningStages.contains(failedStage)}} will be true, but we've already handled these failures. This results in multiple concurrent non-zombie attempts for one stage. In addition to being very confusing, and a waste of resources, this also can lead to later stages being submitted before the previous stage has registered its map output. This happens because (a) when one attempt finishes all its tasks, it may not register its map output because the stage still has pending tasks, from other attempts https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1046 {code} if (runningStages.contains(shuffleStage) shuffleStage.pendingTasks.isEmpty) { {code} and (b) {{submitStage}} thinks the following stage is ready to go, because {{getMissingParentStages}} thinks the stage is complete as long it has all of its map outputs: https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L397 {code} if (!mapStage.isAvailable) { missing += mapStage } {code} So the following stage is submitted repeatedly, but it is doomed to fail because its shuffle output has never been registered with the map output tracker. Here's an example failure in this case: {noformat} WARN TaskSetManager: Lost task 5.0 in stage 3.2 (TID 294, 192.168.1.104): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=5, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing output locations for shuffle ... {noformat} Note that this is a subset of the problems originally described in SPARK-7308, limited to just the issues effecting the DAGScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN
[ https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581038#comment-14581038 ] Mridul Muralidharan commented on SPARK-8297: Spark on mesos handles this situation by calling removeExecutor() on the scheduler backend - yarn module does not. I have this fixed locally, but unfortunately, I do not have the bandwidth to shepherd a patch. The fix is simple - replicate something similar to what is done in CoarseMesosSchedulerBackend.slaveLost(). Essentially : a) maintain a mapping from container-id to executor-id in YarnAllocator (consistent with and inverse of executorIdToContainer) b) propagate the scheduler backend to YarnAllocator when YarnClusterScheduler.postCommitHook is called, c) In processCompletedContainers, if the container is not in releasedContainers, invoke backend.removeExecutor(executorId, msg) to notify backend that the executor has not exit'ed gracefully/expectedly. d) Remove mapping from containerIdToExecutorId and executorIdToContainer in processCompletedContainers (The latter also fixes a memory leak in YarnAllocator btw). In case no one is picking this one up, I can fix it later in 1.5 release cycle. Scheduler backend is not notified in case node fails in YARN Key: SPARK-8297 URL: https://issues.apache.org/jira/browse/SPARK-8297 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.2, 1.3.1, 1.4.1 Environment: Spark on yarn - both client and cluster mode. Reporter: Mridul Muralidharan Priority: Critical When a node crashes, yarn detects the failure and notifies spark - but this information is not propagated to scheduler backend (unlike in mesos mode, for example). It results in repeated re-execution of stages (due to FetchFailedException on shuffle side), resulting finally in application failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN
[ https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-8297: --- Affects Version/s: 1.5.0 Scheduler backend is not notified in case node fails in YARN Key: SPARK-8297 URL: https://issues.apache.org/jira/browse/SPARK-8297 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0 Environment: Spark on yarn - both client and cluster mode. Reporter: Mridul Muralidharan Priority: Critical When a node crashes, yarn detects the failure and notifies spark - but this information is not propagated to scheduler backend (unlike in mesos mode, for example). It results in repeated re-execution of stages (due to FetchFailedException on shuffle side), resulting finally in application failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6419) GenerateOrdering does not support BinaryType and complex types.
[ https://issues.apache.org/jira/browse/SPARK-6419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-6419. --- Resolution: Fixed Fix Version/s: 1.5.0 GenerateOrdering does not support BinaryType and complex types. --- Key: SPARK-6419 URL: https://issues.apache.org/jira/browse/SPARK-6419 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Davies Liu Fix For: 1.5.0 When user want to order by binary columns or columns with complex types and code gen is enabled, there will be a MatchError ([see here|https://github.com/apache/spark/blob/v1.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala#L45]). We can either add supports for these types or have a function to check if we can safely call GenerateOrdering (like the canBeCodeGened for HashAggregation Strategy). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8293) Add high-level java docs to important YARN classes
Andrew Or created SPARK-8293: Summary: Add high-level java docs to important YARN classes Key: SPARK-8293 URL: https://issues.apache.org/jira/browse/SPARK-8293 Project: Spark Issue Type: Sub-task Components: Documentation, YARN Affects Versions: 1.4.0 Reporter: Andrew Or The YARN integration code has been with Spark for many releases now. However, important classes like `Client`, `ApplicationMaster` and `ExecutorRunnable` don't even have detailed javadocs yet. This makes it difficult to follow the code without digging into too much detail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8294) Break down large methods in YARN code
Andrew Or created SPARK-8294: Summary: Break down large methods in YARN code Key: SPARK-8294 URL: https://issues.apache.org/jira/browse/SPARK-8294 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 1.4.0 Reporter: Andrew Or What large methods am I talking about? Client#prepareLocalResources ~ 170 lines ExecutorRunnable#prepareCommand ~ 100 lines Client#setupLaunchEnv ~ 100 lines ... many others that hover around 80 - 90 lines There are several things wrong with this. First, it's difficult to follow / review the code. Second, it's difficult to test it at a fine-granularity. In the past we as a community has been reluctant to add new regression tests for YARN changes. This stems from the fact that it is difficult to write tests, and the cost is that we can't really ensure the correctness of the code easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org