[jira] [Created] (SPARK-16003) SerializationDebugger run into infinite loop
Davies Liu created SPARK-16003: -- Summary: SerializationDebugger run into infinite loop Key: SPARK-16003 URL: https://issues.apache.org/jira/browse/SPARK-16003 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu Priority: Critical This is observed while debugging https://issues.apache.org/jira/browse/SPARK-15811 We should fix it or disable it by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15966) Fix markdown for Spark Monitoring
[ https://issues.apache.org/jira/browse/SPARK-15966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15966: Assignee: Apache Spark > Fix markdown for Spark Monitoring > - > > Key: SPARK-15966 > URL: https://issues.apache.org/jira/browse/SPARK-15966 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Dhruve Ashar >Assignee: Apache Spark >Priority: Trivial > > The markdown for Spark monitoring needs to be fixed. > http://spark.apache.org/docs/2.0.0-preview/monitoring.html > The closing tag is missing for `spark.ui.view.acls.groups`, which is causing > the markdown to render incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15966) Fix markdown for Spark Monitoring
[ https://issues.apache.org/jira/browse/SPARK-15966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15966: Assignee: (was: Apache Spark) > Fix markdown for Spark Monitoring > - > > Key: SPARK-15966 > URL: https://issues.apache.org/jira/browse/SPARK-15966 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Dhruve Ashar >Priority: Trivial > > The markdown for Spark monitoring needs to be fixed. > http://spark.apache.org/docs/2.0.0-preview/monitoring.html > The closing tag is missing for `spark.ui.view.acls.groups`, which is causing > the markdown to render incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15966) Fix markdown for Spark Monitoring
[ https://issues.apache.org/jira/browse/SPARK-15966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334820#comment-15334820 ] Apache Spark commented on SPARK-15966: -- User 'dhruve' has created a pull request for this issue: https://github.com/apache/spark/pull/13719 > Fix markdown for Spark Monitoring > - > > Key: SPARK-15966 > URL: https://issues.apache.org/jira/browse/SPARK-15966 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Dhruve Ashar >Priority: Trivial > > The markdown for Spark monitoring needs to be fixed. > http://spark.apache.org/docs/2.0.0-preview/monitoring.html > The closing tag is missing for `spark.ui.view.acls.groups`, which is causing > the markdown to render incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16002) Sleep when no new data arrives to avoid 100% CPU usage
[ https://issues.apache.org/jira/browse/SPARK-16002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16002: Assignee: Apache Spark (was: Shixiong Zhu) > Sleep when no new data arrives to avoid 100% CPU usage > -- > > Key: SPARK-16002 > URL: https://issues.apache.org/jira/browse/SPARK-16002 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Right now if the trigger is ProcessTrigger(0), StreamExecution will keep > polling new data even if there is no data. Then the CPU usage will be 100%. > We should add a minimum polling delay when no new data arrives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16002) Sleep when no new data arrives to avoid 100% CPU usage
[ https://issues.apache.org/jira/browse/SPARK-16002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16002: Assignee: Shixiong Zhu (was: Apache Spark) > Sleep when no new data arrives to avoid 100% CPU usage > -- > > Key: SPARK-16002 > URL: https://issues.apache.org/jira/browse/SPARK-16002 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now if the trigger is ProcessTrigger(0), StreamExecution will keep > polling new data even if there is no data. Then the CPU usage will be 100%. > We should add a minimum polling delay when no new data arrives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16002) Sleep when no new data arrives to avoid 100% CPU usage
[ https://issues.apache.org/jira/browse/SPARK-16002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334802#comment-15334802 ] Apache Spark commented on SPARK-16002: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/13718 > Sleep when no new data arrives to avoid 100% CPU usage > -- > > Key: SPARK-16002 > URL: https://issues.apache.org/jira/browse/SPARK-16002 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now if the trigger is ProcessTrigger(0), StreamExecution will keep > polling new data even if there is no data. Then the CPU usage will be 100%. > We should add a minimum polling delay when no new data arrives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15966) Fix markdown for Spark Monitoring
[ https://issues.apache.org/jira/browse/SPARK-15966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dhruve Ashar updated SPARK-15966: - Description: The markdown for Spark monitoring needs to be fixed. http://spark.apache.org/docs/2.0.0-preview/monitoring.html The closing tag is missing for `spark.ui.view.acls.groups`, which is causing the markdown to render incorrectly. was: The markdown for Spark monitoring needs to be fixed. http://spark.apache.org/docs/2.0.0-preview/monitoring.html > Fix markdown for Spark Monitoring > - > > Key: SPARK-15966 > URL: https://issues.apache.org/jira/browse/SPARK-15966 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Dhruve Ashar >Priority: Trivial > > The markdown for Spark monitoring needs to be fixed. > http://spark.apache.org/docs/2.0.0-preview/monitoring.html > The closing tag is missing for `spark.ui.view.acls.groups`, which is causing > the markdown to render incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16002) Sleep when no new data arrives to avoid 100% CPU usage
Shixiong Zhu created SPARK-16002: Summary: Sleep when no new data arrives to avoid 100% CPU usage Key: SPARK-16002 URL: https://issues.apache.org/jira/browse/SPARK-16002 Project: Spark Issue Type: Improvement Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu Right now if the trigger is ProcessTrigger(0), StreamExecution will keep polling new data even if there is no data. Then the CPU usage will be 100%. We should add a minimum polling delay when no new data arrives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334763#comment-15334763 ] Yanbo Liang commented on SPARK-16000: - We should only add for models which support save/load in Spark 1.6. Since we do not have save/load backward compatible test framework currently, we can only do offline test right now. If this make sense, I can work on this issue. > Make model loading backward compatible with saved models using old vector > columns > - > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. The main incompatibility is the > vector column type change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15501) ML 2.0 QA: Scala APIs audit for recommendation
[ https://issues.apache.org/jira/browse/SPARK-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334754#comment-15334754 ] Joseph K. Bradley commented on SPARK-15501: --- [~mlnick] Is this audit done, or are there checks remaining? > ML 2.0 QA: Scala APIs audit for recommendation > -- > > Key: SPARK-15501 > URL: https://issues.apache.org/jira/browse/SPARK-15501 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Nick Pentreath >Assignee: Nick Pentreath >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334743#comment-15334743 ] Sean Zhong commented on SPARK-15786: [~yhuai] Sure, we definitely can improve it. > joinWith bytecode generation calling ByteBuffer.wrap with InternalRow > - > > Key: SPARK-15786 > URL: https://issues.apache.org/jira/browse/SPARK-15786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Richard Marscher >Assignee: Sean Zhong > Fix For: 2.0.0 > > > {code}java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 36, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates > are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", > "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, > int)"{code} > I have been trying to use joinWith along with Option data types to get an > approximation of the RDD semantics for outer joins with Dataset to have a > nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode > generation trying to pass an InternalRow object into the ByteBuffer.wrap > function which expects byte[] with or without a couple int qualifiers. > I have a notebook reproducing this against 2.0 preview in Databricks > Community Edition: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15749) Make the error message more meaningful
[ https://issues.apache.org/jira/browse/SPARK-15749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15749. --- Resolution: Fixed Assignee: Huaxin Gao Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Make the error message more meaningful > -- > > Key: SPARK-15749 > URL: https://issues.apache.org/jira/browse/SPARK-15749 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Trivial > Fix For: 2.0.0 > > > For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row using > sqlContext.sql("insert into test1 values ('abc', 'def', 1)") > I got error message > Exception in thread "main" java.lang.RuntimeException: Relation[C1#0,C2#1] > JDBCRelation(test1) > requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE > statement generates the same number of columns as its schema. > The error message is a little confusing. In my simple insert statement, it > doesn't have a SELECT clause. > I will change the error message to a more general one > Exception in thread "main" java.lang.RuntimeException: Relation[C1#0,C2#1] > JDBCRelation(test1) > requires that the data to be inserted have the same number of columns as the > target table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15643) ML 2.0 QA: migration guide update
[ https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334735#comment-15334735 ] Yanbo Liang commented on SPARK-15643: - Sure > ML 2.0 QA: migration guide update > - > > Key: SPARK-15643 > URL: https://issues.apache.org/jira/browse/SPARK-15643 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Blocker > > Update spark.ml and spark.mllib migration guide from 1.6 to 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15868) Executors table in Executors tab should sort Executor IDs in numerical order (not alphabetical order)
[ https://issues.apache.org/jira/browse/SPARK-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15868. --- Resolution: Fixed Assignee: Alex Bozarth Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Executors table in Executors tab should sort Executor IDs in numerical order > (not alphabetical order) > - > > Key: SPARK-15868 > URL: https://issues.apache.org/jira/browse/SPARK-15868 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Alex Bozarth >Priority: Minor > Fix For: 2.0.0 > > Attachments: spark-webui-executors-sorting-2.png, > spark-webui-executors-sorting.png > > > It _appears_ that Executors table in Executors tab sorts Executor IDs in > alphabetical order while it should in numerical. It does sorting in a more > "friendly" way yet driver executor appears between 0 and 1? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15947: -- Summary: Make pipeline components backward compatible with old vector columns in Scala/Java (was: Make pipeline components backward compatible with old vector columns) > Make pipeline components backward compatible with old vector columns in > Scala/Java > -- > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python
[ https://issues.apache.org/jira/browse/SPARK-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-15948. - Resolution: Won't Fix > Make pipeline components backward compatible with old vector columns in Python > -- > > Key: SPARK-15948 > URL: https://issues.apache.org/jira/browse/SPARK-15948 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > Same as SPARK-15947 but for Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15947. --- Resolution: Won't Fix > Make pipeline components backward compatible with old vector columns in > Scala/Java > -- > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python
[ https://issues.apache.org/jira/browse/SPARK-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334732#comment-15334732 ] Xiangrui Meng commented on SPARK-15948: --- Marked this as "Won't Do". See SPARK-15947 for reasons. > Make pipeline components backward compatible with old vector columns in Python > -- > > Key: SPARK-15948 > URL: https://issues.apache.org/jira/browse/SPARK-15948 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > Same as SPARK-15947 but for Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15643) ML 2.0 QA: migration guide update
[ https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15643: -- Assignee: Yanbo Liang > ML 2.0 QA: migration guide update > - > > Key: SPARK-15643 > URL: https://issues.apache.org/jira/browse/SPARK-15643 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Blocker > > Update spark.ml and spark.mllib migration guide from 1.6 to 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15643) ML 2.0 QA: migration guide update
[ https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334729#comment-15334729 ] Xiangrui Meng commented on SPARK-15643: --- [~yanboliang] Please include a paragraph to help users convert vector columns. See https://issues.apache.org/jira/browse/SPARK-15947. > ML 2.0 QA: migration guide update > - > > Key: SPARK-15643 > URL: https://issues.apache.org/jira/browse/SPARK-15643 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Yanbo Liang >Priority: Blocker > > Update spark.ml and spark.mllib migration guide from 1.6 to 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15947) Make pipeline components backward compatible with old vector columns
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334725#comment-15334725 ] Xiangrui Meng edited comment on SPARK-15947 at 6/16/16 9:30 PM: Had an offline discussion with [~josephkb]. There would be lot of work to implement this feature and tests. A simpler choice is to ask users to manually convert the DataFrames at the beginning of the pipeline with tools implemented in SPARK-15945. Then we can update migration guide (SPARK-15643) to include the error message and put this workaround there. So users can search on Google and find the solution. I'm closing this ticket. was (Author: mengxr): Had an offline discussion with [~josephkb]. There would be lot of work to implement this feature and tests. A simpler choice is to ask users to manually convert the DataFrames at the beginning of the pipeline with tools implemented in SPARK-15945. Then we can update migration guide to include the error message and put this workaround there. So users can search on Google and find the solution. I'm closing this ticket. > Make pipeline components backward compatible with old vector columns > > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15947) Make pipeline components backward compatible with old vector columns
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334725#comment-15334725 ] Xiangrui Meng commented on SPARK-15947: --- Had an offline discussion with [~josephkb]. There would be lot of work to implement this feature and tests. A simpler choice is to ask users to manually convert the DataFrames at the beginning of the pipeline with tools implemented in SPARK-15945. Then we can update migration guide to include the error message and put this workaround there. So users can search on Google and find the solution. I'm closing this ticket. > Make pipeline components backward compatible with old vector columns > > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-15343: Attachment: (was: jersey-client-2.22.2.jar) > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 19 more > {code} > On 1.6 everything works fine.
[jira] [Created] (SPARK-16001) request that spark history server write a log entry whenever it (1) tries cleaning old event logs and (2) has found and deleted old event logs
Thanh created SPARK-16001: - Summary: request that spark history server write a log entry whenever it (1) tries cleaning old event logs and (2) has found and deleted old event logs Key: SPARK-16001 URL: https://issues.apache.org/jira/browse/SPARK-16001 Project: Spark Issue Type: Improvement Reporter: Thanh request that spark history server write a log entry whenever it (1) tries cleaning old event logs and (2) has found and deleted old event logs Currently, it doesn't log anything at all, unless there is a failure when trying to cleanLogs() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334721#comment-15334721 ] Joseph K. Bradley commented on SPARK-15767: --- [~vectorijk] Notes from sync: Can you please write more about the possible APIs? I'd like to do a comparison of: * the rpart API * the MLlib DecisionTreeClassifier and DecisionTreeRegressor APIs The comparison should list all parameters and their meaning. The idea is to figure out which of the following we can do: * Best option: Mimic rpart exactly so that R users can switch to spark.rpart easily * Worst option: Sort of mimic rpart, but not exactly because of a difference in functionality, such as new parameters from MLlib or differences in behavior. * Medium option: Avoid rpart API, and instead offer APIs matching DecisionTreeClassifier and DecisionTreeRegressor in the Scala/Java/Python APIs > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Kai Jiang > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.decisionTreeRegression(dataframe, formula, ...) . After having > implemented decision tree classification, we could refactor this two into an > API more like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-15343: Attachment: jersey-client-2.22.2.jar > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: jersey-client-2.22.2.jar > > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 19 more >
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334719#comment-15334719 ] Saisai Shao commented on SPARK-15343: - The class ClientConfig is still existed but the package name is change to org.glassfish.xx. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: jersey-client-2.22.2.jar > > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass
[jira] [Resolved] (SPARK-15998) Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING
[ https://issues.apache.org/jira/browse/SPARK-15998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15998. --- Resolution: Fixed Fix Version/s: 2.0.0 > Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING > > > Key: SPARK-15998 > URL: https://issues.apache.org/jira/browse/SPARK-15998 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > Fix For: 2.0.0 > > > HIVE_METASTORE_PARTITION_PRUNING is a public SQLConf. When true, some > predicates will be pushed down into the Hive metastore so that unmatching > partitions can be eliminated earlier. The current default value is false. > So far, the code base does not have such a test case to verify whether this > SQLConf properly works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15998) Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING
[ https://issues.apache.org/jira/browse/SPARK-15998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15998: -- Assignee: Xiao Li > Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING > > > Key: SPARK-15998 > URL: https://issues.apache.org/jira/browse/SPARK-15998 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > HIVE_METASTORE_PARTITION_PRUNING is a public SQLConf. When true, some > predicates will be pushed down into the Hive metastore so that unmatching > partitions can be eliminated earlier. The current default value is false. > So far, the code base does not have such a test case to verify whether this > SQLConf properly works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334712#comment-15334712 ] Marcelo Vanzin commented on SPARK-15343: Fair point. But I think the right thing then is to just not enable that setting. We can't just stick to really old libraries that cause other problems just because YARN has decided not to move on. Jersey 1.9 causes too many problems when it's in the classpath, making it really hard for people to use newer versions when they need to. Since vanilla Spark has no ATS support, disabling that setting should be ok. Also, it's kinda weird that YARN is even instantiating that client automatically when Spark has no need for it, but I assume there's a good reason for that. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execut
[jira] [Resolved] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-15975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15975. --- Resolution: Fixed Fix Version/s: 2.0.0 1.6.2 1.5.3 > Improper Popen.wait() return code handling in dev/run-tests > --- > > Key: SPARK-15975 > URL: https://issues.apache.org/jira/browse/SPARK-15975 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 1.6.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.5.3, 1.6.2, 2.0.0 > > > In dev/run-tests.py there's a line where we effectively do > {code} > retcode = some_popen_instance.wait() > if retcode > 0: > err > # else do nothing > {code} > but this code is subtlety wrong because Popen's return code will be negative > if the child process was terminated by a signal: > https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode > We should change this to {{retcode != 0}} so that we properly error out and > exit due to termination by signal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15978) Some improvement of "Show Tables"
[ https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15978. --- Resolution: Fixed Assignee: Bo Meng (was: Apache Spark) Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Some improvement of "Show Tables" > - > > Key: SPARK-15978 > URL: https://issues.apache.org/jira/browse/SPARK-15978 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Bo Meng >Assignee: Bo Meng >Priority: Minor > Fix For: 2.0.0 > > > I've found some minor issues in "show tables" command: > 1. In the SessionCatalog.scala, listTables(db: String) method will call > listTables(formatDatabaseName(db), "*") to list all the tables for certain > db, but in the method listTables(db: String, pattern: String), this db name > is formatted once more. So I think we should remove formatDatabaseName() in > the caller. > 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, > just like listDatabases(). > I will make a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334683#comment-15334683 ] Saisai Shao edited comment on SPARK-15343 at 6/16/16 9:06 PM: -- [~vanzin] [~srowen], I don't think it is a vendor specific code, look at the stack trace, it is thrown from {{YarnClientImpl}}, if we enable {{hadoop.yarn.timeline-service.enabled}} we will always meet this problem, no matter in Hadoop 2.6, 2.7 (Apache Hadoop or HDP one). was (Author: jerryshao): [~vanzin] [~srowen], I don't think it is a vendor specific code, look at the stack trace, it is thrown from {{YarnClientImpl}}, if we enable {{hadoop.yarn.timeline-service.enabled}} we will always meet this problem, no matter in Hadoop 2.6, 2.7. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.Constr
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334683#comment-15334683 ] Saisai Shao commented on SPARK-15343: - [~vanzin] [~srowen], I don't think it is a vendor specific code, look at the stack trace, it is thrown from {{YarnClientImpl}}, if we enable {{hadoop.yarn.timeline-service.enabled}} we will always meet this problem, no matter in Hadoop 2.6, 2.7. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at jav
[jira] [Resolved] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15796. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13618 [https://github.com/apache/spark/pull/13618] > Reduce spark.memory.fraction default to avoid overrunning old gen in JVM > default config > --- > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Blocker > Fix For: 2.0.0 > > Attachments: baseline.txt, memfrac06.txt, memfrac063.txt, > memfrac066.txt > > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html > explains that storageFraction is not an upper-limit but a lower limit-like > thing on the size of Spark's cache. The real upper limit is > spark.memory.fraction. > To sum up my questions/issues: > * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. > Maybe the old generation size should also be mentioned in configuration.html > near spark.memory.fraction. > * Is it a goal for Spark to support heavy caching with default parameters and > without GC breakdown? If so, then better default values are needed. -- This message was sent by A
[jira] [Assigned] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-15796: - Assignee: Sean Owen > Reduce spark.memory.fraction default to avoid overrunning old gen in JVM > default config > --- > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Assignee: Sean Owen >Priority: Blocker > Fix For: 2.0.0 > > Attachments: baseline.txt, memfrac06.txt, memfrac063.txt, > memfrac066.txt > > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html > explains that storageFraction is not an upper-limit but a lower limit-like > thing on the size of Spark's cache. The real upper limit is > spark.memory.fraction. > To sum up my questions/issues: > * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. > Maybe the old generation size should also be mentioned in configuration.html > near spark.memory.fraction. > * Is it a goal for Spark to support heavy caching with default parameters and > without GC breakdown? If so, then better default values are needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --
[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16000: -- Description: To help users migrate from Spark 1.6. to 2.0, we should make model loading backward compatible with models saved in 1.6. The main incompatibility is the vector column type change. > Make model loading backward compatible with saved models using old vector > columns > - > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. The main incompatibility is the > vector column type change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15947: -- Description: After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. --Note that this includes loading old saved models.-- SPARK-16000 handles backward compatibility in model loading. was: After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. --Note that this includes loading old saved models.-- SPARK-15948 handles backward compatibility in model loading. > Make pipeline components backward compatible with old vector columns > > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error
[ https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15922. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13643 [https://github.com/apache/spark/pull/13643] > BlockMatrix to IndexedRowMatrix throws an error > --- > > Key: SPARK-15922 > URL: https://issues.apache.org/jira/browse/SPARK-15922 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Charlie Evans > Fix For: 2.0.0 > > > {code} > import org.apache.spark.mllib.linalg.distributed._ > import org.apache.spark.mllib.linalg._ > val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, > new DenseVector(Array(1,2,3))):: IndexedRow(2L, new > DenseVector(Array(1,2,3))):: Nil > val rdd = sc.parallelize(rows) > val matrix = new IndexedRowMatrix(rdd, 3, 3) > val bmat = matrix.toBlockMatrix > val imat = bmat.toIndexedRowMatrix > imat.rows.collect // this throws an error - Caused by: > java.lang.IllegalArgumentException: requirement failed: Vectors must be the > same length! > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error
[ https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15922: -- Assignee: Dongjoon Hyun > BlockMatrix to IndexedRowMatrix throws an error > --- > > Key: SPARK-15922 > URL: https://issues.apache.org/jira/browse/SPARK-15922 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Charlie Evans >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > {code} > import org.apache.spark.mllib.linalg.distributed._ > import org.apache.spark.mllib.linalg._ > val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, > new DenseVector(Array(1,2,3))):: IndexedRow(2L, new > DenseVector(Array(1,2,3))):: Nil > val rdd = sc.parallelize(rows) > val matrix = new IndexedRowMatrix(rdd, 3, 3) > val bmat = matrix.toBlockMatrix > val imat = bmat.toIndexedRowMatrix > imat.rows.collect // this throws an error - Caused by: > java.lang.IllegalArgumentException: requirement failed: Vectors must be the > same length! > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16000: -- Summary: Make model loading backward compatible with saved models using old vector columns (was: Make model loading backward compatible with saved models using old vector columns in Scala/Java) > Make model loading backward compatible with saved models using old vector > columns > - > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15947: -- Description: After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. --Note that this includes loading old saved models.-- SPARK-15948 handles backward compatibility in model loading. was:After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. Note that this includes loading old saved models. > Make pipeline components backward compatible with old vector columns > > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-15948 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15947: -- Summary: Make pipeline components backward compatible with old vector columns (was: Make pipeline components backward compatible with old vector columns in Scala/Java) > Make pipeline components backward compatible with old vector columns > > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. Note that this includes > loading old saved models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
Xiangrui Meng created SPARK-16000: - Summary: Make model loading backward compatible with saved models using old vector columns Key: SPARK-16000 URL: https://issues.apache.org/jira/browse/SPARK-16000 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16000: -- Summary: Make model loading backward compatible with saved models using old vector columns in Scala/Java (was: Make model loading backward compatible with saved models using old vector columns) > Make model loading backward compatible with saved models using old vector > columns in Scala/Java > --- > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15731. --- Resolution: Cannot Reproduce > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334657#comment-15334657 ] Sean Zhong commented on SPARK-14048: [~simeons] I can now reproduce this on Databricks community edition by changing above notebook script to: {code} val rdd = sc.makeRDD( """{"st": {"x.y": 1}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 20}""" :: Nil) sqlContext.read.json(rdd).registerTempTable("test") %sql select first(st) as st from test group by age {code} Thanks! I will post the updates later. > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15977) TRUNCATE TABLE does not work with Datasource tables outside of Hive
[ https://issues.apache.org/jira/browse/SPARK-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-15977. --- Resolution: Resolved > TRUNCATE TABLE does not work with Datasource tables outside of Hive > --- > > Key: SPARK-15977 > URL: https://issues.apache.org/jira/browse/SPARK-15977 > Project: Spark > Issue Type: Bug >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > The {{TRUNCATE TABLE}} command does not work with datasource tables without > Hive support. For example the following doesn't work: > {noformat} > DROP TABLE IF EXISTS test > CREATE TABLE test(a INT, b STRING) USING JSON > INSERT INTO test VALUES (1, 'a'), (2, 'b'), (3, 'c') > SELECT * FROM test > TRUNCATE TABLE test > SELECT * FROM test > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334644#comment-15334644 ] Sean Zhong commented on SPARK-14048: [~simeons] Can you share a complete notebook which we can run and reproduce the problem you saw? For example, the file {{include/init_scala}} is missed in your notebook. > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15069) GSoC 2016: Exposing more R and Python APIs for MLlib
[ https://issues.apache.org/jira/browse/SPARK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334622#comment-15334622 ] Joseph K. Bradley commented on SPARK-15069: --- h4. 6/16/2016 - Week 4 To-do items * Continuation of doc items: [SPARK-15672] * Decision tree API [SPARK-15767] -> I'll add notes to this JIRA * If there is time, begin work on forests or boosting. > GSoC 2016: Exposing more R and Python APIs for MLlib > > > Key: SPARK-15069 > URL: https://issues.apache.org/jira/browse/SPARK-15069 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, SparkR >Reporter: Joseph K. Bradley >Assignee: Kai Jiang > Labels: gsoc2016, mentor > Attachments: 1458791046_[GSoC2016]ApacheSpark_KaiJiang_Proposal.pdf > > > This issue is for tracking the Google Summer of Code 2016 project for Kai > Jiang: "Apache Spark: Exposing more R and Python APIs for MLlib" > See attached proposal for details. Note that the tasks listed in the > proposal are tentative and can adapt as the community works on these various > parts of MLlib. > This umbrella will contain links for tasks included in this project, to be > added as each task begins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15981) Fix bug in python DataStreamReader
[ https://issues.apache.org/jira/browse/SPARK-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-15981. -- Resolution: Fixed Fix Version/s: 2.0.0 > Fix bug in python DataStreamReader > -- > > Key: SPARK-15981 > URL: https://issues.apache.org/jira/browse/SPARK-15981 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > Fix For: 2.0.0 > > > Bug in Python DataStreamReader API made it unusable. Because a single path > was being converted to a array before calling Java DataStreamReader method > (which takes a string only), it gave the following error. > {code} > File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", > line 947, in pyspark.sql.readwriter.DataStreamReader.json > Failed example: > json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), > 'data'), schema = sdf_schema) > Exception raised: > Traceback (most recent call last): > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1253, in __run > compileflags, 1) in test.globs > File "", line > 1, in > json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), > 'data'), schema = sdf_schema) > File > "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line > 963, in json > return self._df(self._jreader.json(path)) > File > "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 316, in get_return_value > format(target_id, ".", name, value)) > Py4JError: An error occurred while calling o121.json. Trace: > py4j.Py4JException: Method json([class java.util.ArrayList]) does not > exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:272) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15999) Wrong/Missing information for Spark UI/REST port
[ https://issues.apache.org/jira/browse/SPARK-15999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15999. --- Resolution: Not A Problem You're referring to an old version -- normally we try to report JIRAs vs master. But this aspect hasn't changed, and I don't think it's confusing. The Spark master UI tries to bind to 4040, then 4041 etc if 4040 is not available. It's true for streaming jobs. You haven't specified what error you encounter in trying to access the REST service, but presumably it's not port related. This has enough problems that I think it should be closed. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. > Wrong/Missing information for Spark UI/REST port > > > Key: SPARK-15999 > URL: https://issues.apache.org/jira/browse/SPARK-15999 > Project: Spark > Issue Type: Bug > Components: Documentation, Streaming >Affects Versions: 1.5.0 > Environment: CDH5.5.2, Spark 1.5.0 >Reporter: Faisal >Priority: Minor > > *Spark Monitoring documentation* > https://spark.apache.org/docs/1.5.0/monitoring.html > {quote} > You can access this interface by simply opening http://:4040 in > a web browser. If multiple SparkContexts are running on the same host, they > will bind to successive ports beginning with 4040 (4041, 4042, etc). > {quote} > This statement is very confusing and doesn't apply at all in spark streaming > jobs(unless i am missing something) > Same is the case with REST API calls. > {quote} > REST API > In addition to viewing the metrics in the UI, they are also available as > JSON. This gives developers an easy way to create new visualizations and > monitoring tools for Spark. The JSON is available for both running > applications, and in the history server. The endpoints are mounted at > /api/v1. Eg., for the history server, they would typically be accessible at > http://:18080/api/v1, and for a running application, at > http://localhost:4040/api/v1. > {quote} > I am running spark streaming job in CDH-5.5.2 Spark version 1.5.0 > and nowhere on driver node, executor node for running/live application i am > able to call rest service. > My spark streaming jobs running in yarn cluster mode > --master yarn-cluster > However for historyServer > i am able to call REST service and can pull up json messages > using the URL > http://historyServer:18088/api/v1/applications > {code} > [ { > "id" : "application_1463099418950_11465", > "name" : "PySparkShell", > "attempts" : [ { > "startTime" : "2016-06-15T15:28:32.460GMT", > "endTime" : "2016-06-15T19:01:39.100GMT", > "sparkUser" : "abc", > "completed" : true > } ] > }, { > "id" : "application_1463099418950_11635", > "name" : "DataProcessor-ETL.ETIME", > "attempts" : [ { > "attemptId" : "1", > "startTime" : "2016-06-15T18:56:04.413GMT", > "endTime" : "2016-06-15T18:58:00.022GMT", > "sparkUser" : "abc", > "completed" : true > } ] > }, > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15999) Wrong/Missing information for Spark UI/REST port
Faisal created SPARK-15999: -- Summary: Wrong/Missing information for Spark UI/REST port Key: SPARK-15999 URL: https://issues.apache.org/jira/browse/SPARK-15999 Project: Spark Issue Type: Bug Components: Documentation, Streaming Affects Versions: 1.5.0 Environment: CDH5.5.2, Spark 1.5.0 Reporter: Faisal Priority: Minor *Spark Monitoring documentation* https://spark.apache.org/docs/1.5.0/monitoring.html {quote} You can access this interface by simply opening http://:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc). {quote} This statement is very confusing and doesn't apply at all in spark streaming jobs(unless i am missing something) Same is the case with REST API calls. {quote} REST API In addition to viewing the metrics in the UI, they are also available as JSON. This gives developers an easy way to create new visualizations and monitoring tools for Spark. The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. Eg., for the history server, they would typically be accessible at http://:18080/api/v1, and for a running application, at http://localhost:4040/api/v1. {quote} I am running spark streaming job in CDH-5.5.2 Spark version 1.5.0 and nowhere on driver node, executor node for running/live application i am able to call rest service. My spark streaming jobs running in yarn cluster mode --master yarn-cluster However for historyServer i am able to call REST service and can pull up json messages using the URL http://historyServer:18088/api/v1/applications {code} [ { "id" : "application_1463099418950_11465", "name" : "PySparkShell", "attempts" : [ { "startTime" : "2016-06-15T15:28:32.460GMT", "endTime" : "2016-06-15T19:01:39.100GMT", "sparkUser" : "abc", "completed" : true } ] }, { "id" : "application_1463099418950_11635", "name" : "DataProcessor-ETL.ETIME", "attempts" : [ { "attemptId" : "1", "startTime" : "2016-06-15T18:56:04.413GMT", "endTime" : "2016-06-15T18:58:00.022GMT", "sparkUser" : "abc", "completed" : true } ] }, {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"
[ https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12114: -- Assignee: Min Qiu > ColumnPruning rule fails in case of "Project <- Filter <- Join" > --- > > Key: SPARK-12114 > URL: https://issues.apache.org/jira/browse/SPARK-12114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Min Qiu >Assignee: Min Qiu > Fix For: 2.0.0 > > > For the query > {code} > SELECT c_name, c_custkey, o_orderkey, o_orderdate, >o_totalprice, sum(l_quantity) > FROM customer join orders join lineitem > on c_custkey = o_custkey AND o_orderkey = l_orderkey > left outer join (SELECT l_orderkey tmp_orderkey > FROM lineitem > GROUP BY l_orderkey > HAVING sum(l_quantity) > 300) tmp > on o_orderkey = tmp_orderkey > WHERE tmp_orderkey IS NOT NULL > GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice > ORDER BY o_totalprice DESC, o_orderdate > {code} > The optimizedPlan is > {code} > Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC] > > Aggregate > \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], > \[c_name#38,c_custkey#37,o_orderkey#45, > o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36] > {color: green}Project > \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58] >Filter IS NOT NULL tmp_orderkey#35 > Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color} > Join Inner, Some((c_custkey#37 = o_custkey#46)) > MetastoreRelation default, customer, None > Join Inner, Some((o_orderkey#45 = l_orderkey#54)) >MetastoreRelation default, orders, None >MetastoreRelation default, lineitem, None > Project \[tmp_orderkey#35] > Filter havingCondition#86 >Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS > havingCondition#86,l_orderkey#70 AS tmp_orderkey#35] > Project \[l_orderkey#70,l_quantity#74] > MetastoreRelation default, lineitem, None > {code} > Due to the pattern highlighted in green that the ColumnPruning rule fails to > deal with, all columns of lineitem and orders tables are scanned. The > unneeded columns are also involved in the data Shuffling. The performance is > extremely bad if any one of the two tables is big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9689) Cache doesn't refresh for HadoopFsRelation based table
[ https://issues.apache.org/jira/browse/SPARK-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9689: - Assignee: Cheng Hao > Cache doesn't refresh for HadoopFsRelation based table > -- > > Key: SPARK-9689 > URL: https://issues.apache.org/jira/browse/SPARK-9689 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheng Hao >Assignee: Cheng Hao > Fix For: 2.0.0 > > > {code:title=example|borderStyle=solid} > // create a HadoopFsRelation based table > sql(s""" > |CREATE TEMPORARY TABLE jsonTable (a int, b string) > |USING org.apache.spark.sql.json.DefaultSource > |OPTIONS ( > | path '${path.toString}' > |)""".stripMargin) > > // give the value from table jt > sql( > s""" > |INSERT OVERWRITE TABLE jsonTable SELECT a, b FROM jt > """.stripMargin) > // cache the HadoopFsRelation Table > sqlContext.cacheTable("jsonTable") > > // update the HadoopFsRelation Table > sql( > s""" > |INSERT OVERWRITE TABLE jsonTable SELECT a * 2, b FROM jt > """.stripMargin) > // Even this will fail > sql("SELECT a, b FROM jsonTable").collect() > // This will fail, as the cache doesn't refresh > checkAnswer( > sql("SELECT a, b FROM jsonTable"), > sql("SELECT a * 2, b FROM jt").collect()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11882) Allow for running Spark applications against a custom coarse grained scheduler
[ https://issues.apache.org/jira/browse/SPARK-11882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11882. --- Resolution: Duplicate Fix Version/s: (was: 2.0.0) > Allow for running Spark applications against a custom coarse grained scheduler > -- > > Key: SPARK-11882 > URL: https://issues.apache.org/jira/browse/SPARK-11882 > Project: Spark > Issue Type: Wish > Components: Spark Core, Spark Submit >Reporter: Jacek Lewandowski >Priority: Minor > > SparkContext makes a decision which scheduler to use according to the Master > URI. How about running applications against a custom scheduler? Such a custom > scheduler would just extend {{CoarseGrainedSchedulerBackend}}. > The custom scheduler would be created by a provided factory. Factories would > be defined in the configuration like > {{spark.scheduler.factory.=}}, where {{name}} is the > scheduler name. {{SparkContext}}, once it learns that master address is not > for standalone, Yarn, Mesos, local or any other predefined scheduler, it > would resolve scheme from the provided master URI and look for the scheduler > factory with the name equal to the resolved scheme. > For example: > {{spark.scheduler.factory.custom=org.a.b.c.CustomSchedulerFactory}} > then Master address would be {{custom://192.168.1.1}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-11882) Allow for running Spark applications against a custom coarse grained scheduler
[ https://issues.apache.org/jira/browse/SPARK-11882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-11882: --- Just resolving as duplicate then > Allow for running Spark applications against a custom coarse grained scheduler > -- > > Key: SPARK-11882 > URL: https://issues.apache.org/jira/browse/SPARK-11882 > Project: Spark > Issue Type: Wish > Components: Spark Core, Spark Submit >Reporter: Jacek Lewandowski >Priority: Minor > > SparkContext makes a decision which scheduler to use according to the Master > URI. How about running applications against a custom scheduler? Such a custom > scheduler would just extend {{CoarseGrainedSchedulerBackend}}. > The custom scheduler would be created by a provided factory. Factories would > be defined in the configuration like > {{spark.scheduler.factory.=}}, where {{name}} is the > scheduler name. {{SparkContext}}, once it learns that master address is not > for standalone, Yarn, Mesos, local or any other predefined scheduler, it > would resolve scheme from the provided master URI and look for the scheduler > factory with the name equal to the resolved scheme. > For example: > {{spark.scheduler.factory.custom=org.a.b.c.CustomSchedulerFactory}} > then Master address would be {{custom://192.168.1.1}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12248) Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios
[ https://issues.apache.org/jira/browse/SPARK-12248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-12248: --- > Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios > -- > > Key: SPARK-12248 > URL: https://issues.apache.org/jira/browse/SPARK-12248 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Charles Allen > Fix For: 2.0.0 > > > It is possible to have spark apps that work best with either more memory or > more CPU. > In a multi-tenant environment (such as Mesos) it can be very beneficial to be > able to limit the Coarse scheduler to guarantee an executor doesn't subscribe > to too many cpus or too much memory. > This ask is to add functionality to the Coarse Mesos Scheduler to have basic > limits to the ratio of memory to cpu, which default to the current behavior > of soaking up whatever resources it can. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15996) Fix R examples by removing deprecated functions
[ https://issues.apache.org/jira/browse/SPARK-15996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15996: -- Assignee: Dongjoon Hyun > Fix R examples by removing deprecated functions > --- > > Key: SPARK-15996 > URL: https://issues.apache.org/jira/browse/SPARK-15996 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > Currently, R examples(dataframe.R and data-manipulation.R) fail like the > following. We had better update those before releasing 2.0 RC. This issue > updates them to use up-to-date APIs. > {code} > $ bin/spark-submit examples/src/main/r/dataframe.R > ... > Warning message: > 'createDataFrame(sqlContext...)' is deprecated. > Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. > See help("Deprecated") > ... > Warning message: > 'read.json(sqlContext...)' is deprecated. > Use 'read.json(path)' instead. > See help("Deprecated") > ... > Error: could not find function "registerTempTable" > Execution halted > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12248) Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios
[ https://issues.apache.org/jira/browse/SPARK-12248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12248. --- Resolution: Not A Problem > Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios > -- > > Key: SPARK-12248 > URL: https://issues.apache.org/jira/browse/SPARK-12248 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Charles Allen > Fix For: 2.0.0 > > > It is possible to have spark apps that work best with either more memory or > more CPU. > In a multi-tenant environment (such as Mesos) it can be very beneficial to be > able to limit the Coarse scheduler to guarantee an executor doesn't subscribe > to too many cpus or too much memory. > This ask is to add functionality to the Coarse Mesos Scheduler to have basic > limits to the ratio of memory to cpu, which default to the current behavior > of soaking up whatever resources it can. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15934) Return binary mode in ThriftServer
[ https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15934: -- Assignee: Egor Pakhomov > Return binary mode in ThriftServer > -- > > Key: SPARK-15934 > URL: https://issues.apache.org/jira/browse/SPARK-15934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Assignee: Egor Pakhomov >Priority: Critical > Fix For: 2.0.0 > > > In spark-2.0.0 preview binary mode was turned off (SPARK-15095). > It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode > was default and it turned off in 2.0.0. > Just to describe magnitude of harm not fixing this bug would do in my > organization: > * Tableau works only though Thrift Server and only with binary format. > Tableau would not work with spark-2.0.0 at all! > * I have bunch of analysts in my organization with configured sql > clients(DataGrip and Squirrel). I would need to go one by one to change > connection string for them(DataGrip). Squirrel simply do not work with http - > some jar hell in my case. > * let me not mention all other stuff which connects to our data > infrastructure through ThriftServer as gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10757) Java friendly constructor for distributed matrices
[ https://issues.apache.org/jira/browse/SPARK-10757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10757. --- Resolution: Won't Fix > Java friendly constructor for distributed matrices > -- > > Key: SPARK-10757 > URL: https://issues.apache.org/jira/browse/SPARK-10757 > Project: Spark > Issue Type: Improvement > Components: Java API, MLlib >Reporter: Yanbo Liang >Priority: Minor > > Currently users can not construct > BlockMatrix/RowMatrix/IndexedRowMatrix/CoordinateMatrix at Java side because > that these classes did not provide java friendly constructors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15467) Getting stack overflow when attempting to query a wide Dataset (>200 fields)
[ https://issues.apache.org/jira/browse/SPARK-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334541#comment-15334541 ] Herman van Hovell commented on SPARK-15467: --- [~kiszk] Shouldn't this be opened against the new repo? https://github.com/janino-compiler/janino > Getting stack overflow when attempting to query a wide Dataset (>200 fields) > > > Key: SPARK-15467 > URL: https://issues.apache.org/jira/browse/SPARK-15467 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Don Drake > > This can be duplicated in a spark-shell, I am running Spark 2.0.0-preview. > {code} > import spark.implicits._ > case class Wide( > val f0:String = "", > val f1:String = "", > val f2:String = "", > val f3:String = "", > val f4:String = "", > val f5:String = "", > val f6:String = "", > val f7:String = "", > val f8:String = "", > val f9:String = "", > val f10:String = "", > val f11:String = "", > val f12:String = "", > val f13:String = "", > val f14:String = "", > val f15:String = "", > val f16:String = "", > val f17:String = "", > val f18:String = "", > val f19:String = "", > val f20:String = "", > val f21:String = "", > val f22:String = "", > val f23:String = "", > val f24:String = "", > val f25:String = "", > val f26:String = "", > val f27:String = "", > val f28:String = "", > val f29:String = "", > val f30:String = "", > val f31:String = "", > val f32:String = "", > val f33:String = "", > val f34:String = "", > val f35:String = "", > val f36:String = "", > val f37:String = "", > val f38:String = "", > val f39:String = "", > val f40:String = "", > val f41:String = "", > val f42:String = "", > val f43:String = "", > val f44:String = "", > val f45:String = "", > val f46:String = "", > val f47:String = "", > val f48:String = "", > val f49:String = "", > val f50:String = "", > val f51:String = "", > val f52:String = "", > val f53:String = "", > val f54:String = "", > val f55:String = "", > val f56:String = "", > val f57:String = "", > val f58:String = "", > val f59:String = "", > val f60:String = "", > val f61:String = "", > val f62:String = "", > val f63:String = "", > val f64:String = "", > val f65:String = "", > val f66:String = "", > val f67:String = "", > val f68:String = "", > val f69:String = "", > val f70:String = "", > val f71:String = "", > val f72:String = "", > val f73:String = "", > val f74:String = "", > val f75:String = "", > val f76:String = "", > val f77:String = "", > val f78:String = "", > val f79:String = "", > val f80:String = "", > val f81:String = "", > val f82:String = "", > val f83:String = "", > val f84:String = "", > val f85:String = "", > val f86:String = "", > val f87:String = "", > val f88:String = "", > val f89:String = "", > val f90:String = "", > val f91:String = "", > val f92:String = "", > val f93:String = "", > val f94:String = "", > val f95:String = "", > val f96:String = "", > val f97:String = "", > val f98:String = "", > val f99:String = "", > val f100:String = "", > val f101:String = "", > val f102:String = "", > val f103:String = "", > val f104:String = "", > val f105:String = "", > val f106:String = "", > val f107:String = "", > val f108:String = "", > val f109:String = "", > val f110:String = "", > val f111:String = "", > val f112:String = "", > val f113:String = "", > val f114:String = "", > val f115:String = "", > val f116:String = "", > val f117:String = "", > val f118:String = "", > val f119:String = "", > val f120:String = "", > val f121:String = "", > val f122:String = "", > val f123:String = "", > val f124:String = "", > val f125:String = "", > val f126:String = "", > val f127:String = "", > val f128:String = "", > val f129:String = "", > val f130:String = "", > val f131:String = "", > val f132:String = "", > val f133:String = "", > val f134:String = "", > val f135:String = "", > val f136:String = "", > val f137:String = "", > val f138:String = "", > val f139:String = "", > val f140:String = "", > val f141:String = "", > val f142:String = "", > val f143:String = "", > val f144:String = "", > val f145:String = "", > val f146:String = "", > val f147:String = "", > val f148:String = "", > val f149:String = "", > val f150:String = "", > val f151:String = "", > val f152:String = "", > val f153:String = "", > val f154:String = "", > val f155:String = "", > val f156:String = "", > val f157:String = "", > val f158:String = "", > val f159:String = "", > val f160:String = "", > val f161:String = "", > val f162:String = "", > val f163:String = "", > val f164:String = "", > val f165:String = "", > val f166:String = "", > val f167:String = "", > val f168:String = "", > val f169:String = "", > val f170:String = "", > val f171:String = "", > val f172:String = "", > val f173:String = "", > val f174:String = "
[jira] [Resolved] (SPARK-15996) Fix R examples by removing deprecated functions
[ https://issues.apache.org/jira/browse/SPARK-15996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-15996. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13714 [https://github.com/apache/spark/pull/13714] > Fix R examples by removing deprecated functions > --- > > Key: SPARK-15996 > URL: https://issues.apache.org/jira/browse/SPARK-15996 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > Currently, R examples(dataframe.R and data-manipulation.R) fail like the > following. We had better update those before releasing 2.0 RC. This issue > updates them to use up-to-date APIs. > {code} > $ bin/spark-submit examples/src/main/r/dataframe.R > ... > Warning message: > 'createDataFrame(sqlContext...)' is deprecated. > Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. > See help("Deprecated") > ... > Warning message: > 'read.json(sqlContext...)' is deprecated. > Use 'read.json(path)' instead. > See help("Deprecated") > ... > Error: could not find function "registerTempTable" > Execution halted > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334474#comment-15334474 ] Apache Spark commented on SPARK-15811: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/13717 > Python UDFs do not work in Spark 2.0-preview built with scala 2.10 > -- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Assignee: Davies Liu >Priority: Blocker > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15811: Assignee: Davies Liu (was: Apache Spark) > Python UDFs do not work in Spark 2.0-preview built with scala 2.10 > -- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Assignee: Davies Liu >Priority: Blocker > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15811: Assignee: Apache Spark (was: Davies Liu) > Python UDFs do not work in Spark 2.0-preview built with scala 2.10 > -- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Assignee: Apache Spark >Priority: Blocker > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334456#comment-15334456 ] Simeon Simeonov edited comment on SPARK-14048 at 6/16/16 7:02 PM: -- [~clockfly] The above code executes with no error on the same cluster where the example I shared fails. As I had speculated earlier, there must be something in the particular data structures we have that triggers the problem, which you can see in the attached notebook. was (Author: simeons): [~clockfly] The code executes with no error on the same cluster where the example I shared fails. As I had speculated earlier, there must be something in the particular data structures we have that triggers the problem, which you can see in the attached notebook. > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: is
[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334456#comment-15334456 ] Simeon Simeonov commented on SPARK-14048: - [~clockfly] The code executes with no error on the same cluster where the example I shared fails. As I had speculated earlier, there must be something in the particular data structures we have that triggers the problem, which you can see in the attached notebook. > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15467) Getting stack overflow when attempting to query a wide Dataset (>200 fields)
[ https://issues.apache.org/jira/browse/SPARK-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334443#comment-15334443 ] Kazuaki Ishizaki commented on SPARK-15467: -- We are waiting for author's review at https://github.com/aunkrig/janino/pull/7 > Getting stack overflow when attempting to query a wide Dataset (>200 fields) > > > Key: SPARK-15467 > URL: https://issues.apache.org/jira/browse/SPARK-15467 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Don Drake > > This can be duplicated in a spark-shell, I am running Spark 2.0.0-preview. > {code} > import spark.implicits._ > case class Wide( > val f0:String = "", > val f1:String = "", > val f2:String = "", > val f3:String = "", > val f4:String = "", > val f5:String = "", > val f6:String = "", > val f7:String = "", > val f8:String = "", > val f9:String = "", > val f10:String = "", > val f11:String = "", > val f12:String = "", > val f13:String = "", > val f14:String = "", > val f15:String = "", > val f16:String = "", > val f17:String = "", > val f18:String = "", > val f19:String = "", > val f20:String = "", > val f21:String = "", > val f22:String = "", > val f23:String = "", > val f24:String = "", > val f25:String = "", > val f26:String = "", > val f27:String = "", > val f28:String = "", > val f29:String = "", > val f30:String = "", > val f31:String = "", > val f32:String = "", > val f33:String = "", > val f34:String = "", > val f35:String = "", > val f36:String = "", > val f37:String = "", > val f38:String = "", > val f39:String = "", > val f40:String = "", > val f41:String = "", > val f42:String = "", > val f43:String = "", > val f44:String = "", > val f45:String = "", > val f46:String = "", > val f47:String = "", > val f48:String = "", > val f49:String = "", > val f50:String = "", > val f51:String = "", > val f52:String = "", > val f53:String = "", > val f54:String = "", > val f55:String = "", > val f56:String = "", > val f57:String = "", > val f58:String = "", > val f59:String = "", > val f60:String = "", > val f61:String = "", > val f62:String = "", > val f63:String = "", > val f64:String = "", > val f65:String = "", > val f66:String = "", > val f67:String = "", > val f68:String = "", > val f69:String = "", > val f70:String = "", > val f71:String = "", > val f72:String = "", > val f73:String = "", > val f74:String = "", > val f75:String = "", > val f76:String = "", > val f77:String = "", > val f78:String = "", > val f79:String = "", > val f80:String = "", > val f81:String = "", > val f82:String = "", > val f83:String = "", > val f84:String = "", > val f85:String = "", > val f86:String = "", > val f87:String = "", > val f88:String = "", > val f89:String = "", > val f90:String = "", > val f91:String = "", > val f92:String = "", > val f93:String = "", > val f94:String = "", > val f95:String = "", > val f96:String = "", > val f97:String = "", > val f98:String = "", > val f99:String = "", > val f100:String = "", > val f101:String = "", > val f102:String = "", > val f103:String = "", > val f104:String = "", > val f105:String = "", > val f106:String = "", > val f107:String = "", > val f108:String = "", > val f109:String = "", > val f110:String = "", > val f111:String = "", > val f112:String = "", > val f113:String = "", > val f114:String = "", > val f115:String = "", > val f116:String = "", > val f117:String = "", > val f118:String = "", > val f119:String = "", > val f120:String = "", > val f121:String = "", > val f122:String = "", > val f123:String = "", > val f124:String = "", > val f125:String = "", > val f126:String = "", > val f127:String = "", > val f128:String = "", > val f129:String = "", > val f130:String = "", > val f131:String = "", > val f132:String = "", > val f133:String = "", > val f134:String = "", > val f135:String = "", > val f136:String = "", > val f137:String = "", > val f138:String = "", > val f139:String = "", > val f140:String = "", > val f141:String = "", > val f142:String = "", > val f143:String = "", > val f144:String = "", > val f145:String = "", > val f146:String = "", > val f147:String = "", > val f148:String = "", > val f149:String = "", > val f150:String = "", > val f151:String = "", > val f152:String = "", > val f153:String = "", > val f154:String = "", > val f155:String = "", > val f156:String = "", > val f157:String = "", > val f158:String = "", > val f159:String = "", > val f160:String = "", > val f161:String = "", > val f162:String = "", > val f163:String = "", > val f164:String = "", > val f165:String = "", > val f166:String = "", > val f167:String = "", > val f168:String = "", > val f169:String = "", > val f170:String = "", > val f171:String = "", > val f172:String = "", > val f173:String = "", > val f174:String = "", > val f175:String =
[jira] [Updated] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15811: --- Description: I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following {code} ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive {code} and then ran the following code in a pyspark shell {code} from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() {code} This never returns with a result. was: I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following {code} ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive {code} and then ran the following code in a pyspark shell {code} from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() {code} This never returns with a result. > Python UDFs do not work in Spark 2.0-preview built with scala 2.10 > -- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Assignee: Davies Liu >Priority: Blocker > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15998) Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING
[ https://issues.apache.org/jira/browse/SPARK-15998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15998: Assignee: Apache Spark > Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING > > > Key: SPARK-15998 > URL: https://issues.apache.org/jira/browse/SPARK-15998 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > HIVE_METASTORE_PARTITION_PRUNING is a public SQLConf. When true, some > predicates will be pushed down into the Hive metastore so that unmatching > partitions can be eliminated earlier. The current default value is false. > So far, the code base does not have such a test case to verify whether this > SQLConf properly works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15998) Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING
[ https://issues.apache.org/jira/browse/SPARK-15998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334418#comment-15334418 ] Apache Spark commented on SPARK-15998: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/13716 > Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING > > > Key: SPARK-15998 > URL: https://issues.apache.org/jira/browse/SPARK-15998 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > HIVE_METASTORE_PARTITION_PRUNING is a public SQLConf. When true, some > predicates will be pushed down into the Hive metastore so that unmatching > partitions can be eliminated earlier. The current default value is false. > So far, the code base does not have such a test case to verify whether this > SQLConf properly works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15998) Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING
[ https://issues.apache.org/jira/browse/SPARK-15998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15998: Assignee: (was: Apache Spark) > Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING > > > Key: SPARK-15998 > URL: https://issues.apache.org/jira/browse/SPARK-15998 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > HIVE_METASTORE_PARTITION_PRUNING is a public SQLConf. When true, some > predicates will be pushed down into the Hive metastore so that unmatching > partitions can be eliminated earlier. The current default value is false. > So far, the code base does not have such a test case to verify whether this > SQLConf properly works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15998) Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING
Xiao Li created SPARK-15998: --- Summary: Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING Key: SPARK-15998 URL: https://issues.apache.org/jira/browse/SPARK-15998 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li HIVE_METASTORE_PARTITION_PRUNING is a public SQLConf. When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. The current default value is false. So far, the code base does not have such a test case to verify whether this SQLConf properly works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers
Gayathri Murali created SPARK-15997: --- Summary: Audit ml.feature Update documentation for ml feature transformers Key: SPARK-15997 URL: https://issues.apache.org/jira/browse/SPARK-15997 Project: Spark Issue Type: Documentation Components: ML, MLlib Affects Versions: 2.0.0 Reporter: Gayathri Murali This JIRA is a subtask of SPARK-15100 and improves documentation for new features added to 1. HashingTF 2. Countvectorizer 3. QuantileDiscretizer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15996) Fix R examples by removing deprecated functions
[ https://issues.apache.org/jira/browse/SPARK-15996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15996: -- Description: Currently, R examples(dataframe.R and data-manipulation.R) fail like the following. We had better update those before releasing 2.0 RC. This issue updates them to use up-to-date APIs. {code} $ bin/spark-submit examples/src/main/r/dataframe.R ... Warning message: 'createDataFrame(sqlContext...)' is deprecated. Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. See help("Deprecated") ... Warning message: 'read.json(sqlContext...)' is deprecated. Use 'read.json(path)' instead. See help("Deprecated") ... Error: could not find function "registerTempTable" Execution halted {code} was: Currently, R dataframe example fails like the following. We had better update that before releasing 2.0 RC. This issue update that to use up-to-date APIs. {code} $ bin/spark-submit examples/src/main/r/dataframe.R ... Warning message: 'createDataFrame(sqlContext...)' is deprecated. Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. See help("Deprecated") ... Warning message: 'read.json(sqlContext...)' is deprecated. Use 'read.json(path)' instead. See help("Deprecated") ... Error: could not find function "registerTempTable" Execution halted {code} Summary: Fix R examples by removing deprecated functions (was: Fix R dataframe example by removing deprecated functions) > Fix R examples by removing deprecated functions > --- > > Key: SPARK-15996 > URL: https://issues.apache.org/jira/browse/SPARK-15996 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Dongjoon Hyun >Priority: Minor > > Currently, R examples(dataframe.R and data-manipulation.R) fail like the > following. We had better update those before releasing 2.0 RC. This issue > updates them to use up-to-date APIs. > {code} > $ bin/spark-submit examples/src/main/r/dataframe.R > ... > Warning message: > 'createDataFrame(sqlContext...)' is deprecated. > Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. > See help("Deprecated") > ... > Warning message: > 'read.json(sqlContext...)' is deprecated. > Use 'read.json(path)' instead. > See help("Deprecated") > ... > Error: could not find function "registerTempTable" > Execution halted > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15981) Fix bug in python DataStreamReader
[ https://issues.apache.org/jira/browse/SPARK-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-15981: -- Description: Bug in Python DataStreamReader API made it unusable. Because a single path was being converted to a array before calling Java DataStreamReader method (which takes a string only), it gave the following error. {code} File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 947, in pyspark.sql.readwriter.DataStreamReader.json Failed example: json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'), schema = sdf_schema) Exception raised: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1253, in __run compileflags, 1) in test.globs File "", line 1, in json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'), schema = sdf_schema) File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 963, in json return self._df(self._jreader.json(path)) File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value format(target_id, ".", name, value)) Py4JError: An error occurred while calling o121.json. Trace: py4j.Py4JException: Method json([class java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:744) {code} was: Bug in Python DataStreamReader API made it unusable. > Fix bug in python DataStreamReader > -- > > Key: SPARK-15981 > URL: https://issues.apache.org/jira/browse/SPARK-15981 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > Bug in Python DataStreamReader API made it unusable. Because a single path > was being converted to a array before calling Java DataStreamReader method > (which takes a string only), it gave the following error. > {code} > File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", > line 947, in pyspark.sql.readwriter.DataStreamReader.json > Failed example: > json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), > 'data'), schema = sdf_schema) > Exception raised: > Traceback (most recent call last): > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", > line 1253, in __run > compileflags, 1) in test.globs > File "", line > 1, in > json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), > 'data'), schema = sdf_schema) > File > "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line > 963, in json > return self._df(self._jreader.json(path)) > File > "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 316, in get_return_value > format(target_id, ".", name, value)) > Py4JError: An error occurred while calling o121.json. Trace: > py4j.Py4JException: Method json([class java.util.ArrayList]) does not > exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:272) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA
[jira] [Closed] (SPARK-12248) Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios
[ https://issues.apache.org/jira/browse/SPARK-12248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Allen closed SPARK-12248. - Resolution: Fixed Fix Version/s: 2.0.0 > Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios > -- > > Key: SPARK-12248 > URL: https://issues.apache.org/jira/browse/SPARK-12248 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Charles Allen > Fix For: 2.0.0 > > > It is possible to have spark apps that work best with either more memory or > more CPU. > In a multi-tenant environment (such as Mesos) it can be very beneficial to be > able to limit the Coarse scheduler to guarantee an executor doesn't subscribe > to too many cpus or too much memory. > This ask is to add functionality to the Coarse Mesos Scheduler to have basic > limits to the ratio of memory to cpu, which default to the current behavior > of soaking up whatever resources it can. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12248) Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios
[ https://issues.apache.org/jira/browse/SPARK-12248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334376#comment-15334376 ] Charles Allen commented on SPARK-12248: --- The limit of one task per slave seems to have been removed. That solves at least my use case in this matter. > Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios > -- > > Key: SPARK-12248 > URL: https://issues.apache.org/jira/browse/SPARK-12248 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Charles Allen > Fix For: 2.0.0 > > > It is possible to have spark apps that work best with either more memory or > more CPU. > In a multi-tenant environment (such as Mesos) it can be very beneficial to be > able to limit the Coarse scheduler to guarantee an executor doesn't subscribe > to too many cpus or too much memory. > This ask is to add functionality to the Coarse Mesos Scheduler to have basic > limits to the ratio of memory to cpu, which default to the current behavior > of soaking up whatever resources it can. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15992) Code cleanup mesos coarse backend offer evaluation workflow
[ https://issues.apache.org/jira/browse/SPARK-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Allen updated SPARK-15992: -- Attachment: (was: 0001-Refactor-MesosCoarseGrainedSchedulerBackend-offer-co.patch) > Code cleanup mesos coarse backend offer evaluation workflow > --- > > Key: SPARK-15992 > URL: https://issues.apache.org/jira/browse/SPARK-15992 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Charles Allen > Labels: code-cleanup > > The offer acceptance workflow is a little hard to follow and not very > extensible for future considerations for offers. This is a patch that makes > the workflow a little more explicit in its handling of offer resources. > Patch incoming -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15994) Allow enabling Mesos fetch cache in coarse executor backend
[ https://issues.apache.org/jira/browse/SPARK-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Allen updated SPARK-15994: -- Attachment: (was: 0001-Add-ability-to-enable-mesos-fetch-cache.patch) > Allow enabling Mesos fetch cache in coarse executor backend > > > Key: SPARK-15994 > URL: https://issues.apache.org/jira/browse/SPARK-15994 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Mesos 0.23.0 introduces a Fetch Cache feature > http://mesos.apache.org/documentation/latest/fetcher/ which allows caching of > resources specified in command URIs. > This patch: > * Updates the Mesos shaded protobuf dependency to 0.23.0 > * Allows setting `spark.mesos.fetchCache.enable` to enable the fetch cache > for all specified URIs. (URIs must be specified for the setting to have any > affect) > * Updates documentation for Mesos configuration with the new setting. > This patch does NOT: > * Allow for per-URI caching configuration. The cache setting is global to ALL > URIs for the command. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15992) Code cleanup mesos coarse backend offer evaluation workflow
[ https://issues.apache.org/jira/browse/SPARK-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15992: Assignee: Apache Spark > Code cleanup mesos coarse backend offer evaluation workflow > --- > > Key: SPARK-15992 > URL: https://issues.apache.org/jira/browse/SPARK-15992 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Charles Allen >Assignee: Apache Spark > Labels: code-cleanup > Attachments: > 0001-Refactor-MesosCoarseGrainedSchedulerBackend-offer-co.patch > > > The offer acceptance workflow is a little hard to follow and not very > extensible for future considerations for offers. This is a patch that makes > the workflow a little more explicit in its handling of offer resources. > Patch incoming -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15992) Code cleanup mesos coarse backend offer evaluation workflow
[ https://issues.apache.org/jira/browse/SPARK-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15992: Assignee: (was: Apache Spark) > Code cleanup mesos coarse backend offer evaluation workflow > --- > > Key: SPARK-15992 > URL: https://issues.apache.org/jira/browse/SPARK-15992 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Charles Allen > Labels: code-cleanup > Attachments: > 0001-Refactor-MesosCoarseGrainedSchedulerBackend-offer-co.patch > > > The offer acceptance workflow is a little hard to follow and not very > extensible for future considerations for offers. This is a patch that makes > the workflow a little more explicit in its handling of offer resources. > Patch incoming -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15992) Code cleanup mesos coarse backend offer evaluation workflow
[ https://issues.apache.org/jira/browse/SPARK-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334364#comment-15334364 ] Apache Spark commented on SPARK-15992: -- User 'drcrallen' has created a pull request for this issue: https://github.com/apache/spark/pull/13715 > Code cleanup mesos coarse backend offer evaluation workflow > --- > > Key: SPARK-15992 > URL: https://issues.apache.org/jira/browse/SPARK-15992 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Charles Allen > Labels: code-cleanup > Attachments: > 0001-Refactor-MesosCoarseGrainedSchedulerBackend-offer-co.patch > > > The offer acceptance workflow is a little hard to follow and not very > extensible for future considerations for offers. This is a patch that makes > the workflow a little more explicit in its handling of offer resources. > Patch incoming -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15996) Fix R dataframe example by removing deprecated functions
[ https://issues.apache.org/jira/browse/SPARK-15996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15996: Assignee: (was: Apache Spark) > Fix R dataframe example by removing deprecated functions > > > Key: SPARK-15996 > URL: https://issues.apache.org/jira/browse/SPARK-15996 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Dongjoon Hyun >Priority: Minor > > Currently, R dataframe example fails like the following. We had better update > that before releasing 2.0 RC. This issue update that to use up-to-date APIs. > {code} > $ bin/spark-submit examples/src/main/r/dataframe.R > ... > Warning message: > 'createDataFrame(sqlContext...)' is deprecated. > Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. > See help("Deprecated") > ... > Warning message: > 'read.json(sqlContext...)' is deprecated. > Use 'read.json(path)' instead. > See help("Deprecated") > ... > Error: could not find function "registerTempTable" > Execution halted > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15996) Fix R dataframe example by removing deprecated functions
[ https://issues.apache.org/jira/browse/SPARK-15996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15996: Assignee: Apache Spark > Fix R dataframe example by removing deprecated functions > > > Key: SPARK-15996 > URL: https://issues.apache.org/jira/browse/SPARK-15996 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Currently, R dataframe example fails like the following. We had better update > that before releasing 2.0 RC. This issue update that to use up-to-date APIs. > {code} > $ bin/spark-submit examples/src/main/r/dataframe.R > ... > Warning message: > 'createDataFrame(sqlContext...)' is deprecated. > Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. > See help("Deprecated") > ... > Warning message: > 'read.json(sqlContext...)' is deprecated. > Use 'read.json(path)' instead. > See help("Deprecated") > ... > Error: could not find function "registerTempTable" > Execution halted > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15996) Fix R dataframe example by removing deprecated functions
[ https://issues.apache.org/jira/browse/SPARK-15996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334362#comment-15334362 ] Apache Spark commented on SPARK-15996: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/13714 > Fix R dataframe example by removing deprecated functions > > > Key: SPARK-15996 > URL: https://issues.apache.org/jira/browse/SPARK-15996 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Dongjoon Hyun >Priority: Minor > > Currently, R dataframe example fails like the following. We had better update > that before releasing 2.0 RC. This issue update that to use up-to-date APIs. > {code} > $ bin/spark-submit examples/src/main/r/dataframe.R > ... > Warning message: > 'createDataFrame(sqlContext...)' is deprecated. > Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. > See help("Deprecated") > ... > Warning message: > 'read.json(sqlContext...)' is deprecated. > Use 'read.json(path)' instead. > See help("Deprecated") > ... > Error: could not find function "registerTempTable" > Execution halted > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15994) Allow enabling Mesos fetch cache in coarse executor backend
[ https://issues.apache.org/jira/browse/SPARK-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15994: Assignee: Apache Spark > Allow enabling Mesos fetch cache in coarse executor backend > > > Key: SPARK-15994 > URL: https://issues.apache.org/jira/browse/SPARK-15994 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Charles Allen >Assignee: Apache Spark > Attachments: 0001-Add-ability-to-enable-mesos-fetch-cache.patch > > > Mesos 0.23.0 introduces a Fetch Cache feature > http://mesos.apache.org/documentation/latest/fetcher/ which allows caching of > resources specified in command URIs. > This patch: > * Updates the Mesos shaded protobuf dependency to 0.23.0 > * Allows setting `spark.mesos.fetchCache.enable` to enable the fetch cache > for all specified URIs. (URIs must be specified for the setting to have any > affect) > * Updates documentation for Mesos configuration with the new setting. > This patch does NOT: > * Allow for per-URI caching configuration. The cache setting is global to ALL > URIs for the command. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15994) Allow enabling Mesos fetch cache in coarse executor backend
[ https://issues.apache.org/jira/browse/SPARK-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334356#comment-15334356 ] Apache Spark commented on SPARK-15994: -- User 'drcrallen' has created a pull request for this issue: https://github.com/apache/spark/pull/13713 > Allow enabling Mesos fetch cache in coarse executor backend > > > Key: SPARK-15994 > URL: https://issues.apache.org/jira/browse/SPARK-15994 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Charles Allen > Attachments: 0001-Add-ability-to-enable-mesos-fetch-cache.patch > > > Mesos 0.23.0 introduces a Fetch Cache feature > http://mesos.apache.org/documentation/latest/fetcher/ which allows caching of > resources specified in command URIs. > This patch: > * Updates the Mesos shaded protobuf dependency to 0.23.0 > * Allows setting `spark.mesos.fetchCache.enable` to enable the fetch cache > for all specified URIs. (URIs must be specified for the setting to have any > affect) > * Updates documentation for Mesos configuration with the new setting. > This patch does NOT: > * Allow for per-URI caching configuration. The cache setting is global to ALL > URIs for the command. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15994) Allow enabling Mesos fetch cache in coarse executor backend
[ https://issues.apache.org/jira/browse/SPARK-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15994: Assignee: (was: Apache Spark) > Allow enabling Mesos fetch cache in coarse executor backend > > > Key: SPARK-15994 > URL: https://issues.apache.org/jira/browse/SPARK-15994 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Charles Allen > Attachments: 0001-Add-ability-to-enable-mesos-fetch-cache.patch > > > Mesos 0.23.0 introduces a Fetch Cache feature > http://mesos.apache.org/documentation/latest/fetcher/ which allows caching of > resources specified in command URIs. > This patch: > * Updates the Mesos shaded protobuf dependency to 0.23.0 > * Allows setting `spark.mesos.fetchCache.enable` to enable the fetch cache > for all specified URIs. (URIs must be specified for the setting to have any > affect) > * Updates documentation for Mesos configuration with the new setting. > This patch does NOT: > * Allow for per-URI caching configuration. The cache setting is global to ALL > URIs for the command. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15608) Add document for ML IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-15608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15608: -- Issue Type: Documentation (was: Sub-task) Parent: (was: SPARK-15099) > Add document for ML IsotonicRegression > -- > > Key: SPARK-15608 > URL: https://issues.apache.org/jira/browse/SPARK-15608 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Yanbo Liang >Priority: Minor > > Feel free to copy the document from mllib to ml for IsotonicRegression, and > update it if necessary. > Meanwhile, add examples and use "include_example" to include them in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15099) Audit: ml.regression
[ https://issues.apache.org/jira/browse/SPARK-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15099: -- Assignee: Yanbo Liang > Audit: ml.regression > > > Key: SPARK-15099 > URL: https://issues.apache.org/jira/browse/SPARK-15099 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit this sub-package for new algorithms which do not have corresponding > sections & examples in the user guide. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15099) Audit: ml.regression
[ https://issues.apache.org/jira/browse/SPARK-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-15099. --- Resolution: Done Marking as done since this JIRA is just for the audit. Thanks [~yanboliang]! > Audit: ml.regression > > > Key: SPARK-15099 > URL: https://issues.apache.org/jira/browse/SPARK-15099 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit this sub-package for new algorithms which do not have corresponding > sections & examples in the user guide. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15996) Fix R dataframe example by removing deprecated functions
Dongjoon Hyun created SPARK-15996: - Summary: Fix R dataframe example by removing deprecated functions Key: SPARK-15996 URL: https://issues.apache.org/jira/browse/SPARK-15996 Project: Spark Issue Type: Bug Components: Examples Reporter: Dongjoon Hyun Priority: Minor Currently, R dataframe example fails like the following. We had better update that before releasing 2.0 RC. This issue update that to use up-to-date APIs. {code} $ bin/spark-submit examples/src/main/r/dataframe.R ... Warning message: 'createDataFrame(sqlContext...)' is deprecated. Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. See help("Deprecated") ... Warning message: 'read.json(sqlContext...)' is deprecated. Use 'read.json(path)' instead. See help("Deprecated") ... Error: could not find function "registerTempTable" Execution halted {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15100) Audit: ml.feature
[ https://issues.apache.org/jira/browse/SPARK-15100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334346#comment-15334346 ] Joseph K. Bradley commented on SPARK-15100: --- [~yuhaoyan] Is it correct that you finished the audit of ml.feature? Also, can you please make sure that there are subtasks for each of the issues identified during the audit & that they are linked here? Then we can close this issue. Thanks! > Audit: ml.feature > - > > Key: SPARK-15100 > URL: https://issues.apache.org/jira/browse/SPARK-15100 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit this sub-package for new algorithms which do not have corresponding > sections & examples in the user guide. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334331#comment-15334331 ] Yin Huai commented on SPARK-15786: -- Is there any chance that we can let users know what is wrong exactly? This error message is much better than the previous error. However, it looks like it still does not point out what part of the user code is not allowed. > joinWith bytecode generation calling ByteBuffer.wrap with InternalRow > - > > Key: SPARK-15786 > URL: https://issues.apache.org/jira/browse/SPARK-15786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Richard Marscher >Assignee: Sean Zhong > Fix For: 2.0.0 > > > {code}java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 36, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates > are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", > "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, > int)"{code} > I have been trying to use joinWith along with Option data types to get an > approximation of the RDD semantics for outer joins with Dataset to have a > nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode > generation trying to pass an InternalRow object into the ByteBuffer.wrap > function which expects byte[] with or without a couple int qualifiers. > I have a notebook reproducing this against 2.0 preview in Databricks > Community Edition: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15990) Support rolling log aggregation for Spark running on YARN
[ https://issues.apache.org/jira/browse/SPARK-15990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334320#comment-15334320 ] Apache Spark commented on SPARK-15990: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/13712 > Support rolling log aggregation for Spark running on YARN > - > > Key: SPARK-15990 > URL: https://issues.apache.org/jira/browse/SPARK-15990 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > YARN supports rolling log aggregation since version 2.6+, it will aggregate > the logs in a timely manner and upload to the HDFS. Compared to the previous > log aggregation method which only aggregate the logs after application is > finished, this will speed up the log aggregation time. Also this will avoid > too large log file problem (out of disk). > So here propose to introduce this feature for Spark on YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15990) Support rolling log aggregation for Spark running on YARN
[ https://issues.apache.org/jira/browse/SPARK-15990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15990: Assignee: (was: Apache Spark) > Support rolling log aggregation for Spark running on YARN > - > > Key: SPARK-15990 > URL: https://issues.apache.org/jira/browse/SPARK-15990 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Saisai Shao >Priority: Minor > > YARN supports rolling log aggregation since version 2.6+, it will aggregate > the logs in a timely manner and upload to the HDFS. Compared to the previous > log aggregation method which only aggregate the logs after application is > finished, this will speed up the log aggregation time. Also this will avoid > too large log file problem (out of disk). > So here propose to introduce this feature for Spark on YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org