[jira] [Commented] (SPARK-5991) Python API for ML model import/export
[ https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339853#comment-14339853 ] Apache Spark commented on SPARK-5991: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4811 > Python API for ML model import/export > - > > Key: SPARK-5991 > URL: https://issues.apache.org/jira/browse/SPARK-5991 > Project: Spark > Issue Type: Umbrella > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Critical > > Many ML models support save/load in Scala and Java. The Python API needs > this. It should mostly be a simple matter of calling the JVM methods for > save/load, except for models which are stored in Python (e.g., linear models). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container
SaintBacchus created SPARK-6056: --- Summary: Unlimit offHeap memory use cause RM killing the container Key: SPARK-6056 URL: https://issues.apache.org/jira/browse/SPARK-6056 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.1 Reporter: SaintBacchus -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6055) memory leak in pyspark sql
[ https://issues.apache.org/jira/browse/SPARK-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339850#comment-14339850 ] Apache Spark commented on SPARK-6055: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4810 > memory leak in pyspark sql > -- > > Key: SPARK-6055 > URL: https://issues.apache.org/jira/browse/SPARK-6055 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.1, 1.3.0, 1.2.1 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > The __eq__ of DataType is not correct, class cache is not use correctly > (created class can not be find by dataType), then it will create lots of > classes (saved in _cached_cls), never released. > Also, all same DataType have same hash code, there will be many object in a > dict with the same hash code, end with hash attach, it's very slow to access > this dict (depends on the implementation of CPython). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339849#comment-14339849 ] Pedro Rodriguez commented on SPARK-5556: Based on initial testing, I recall FastLDA in practice being O(1), should be able to confirm that at a larger scale test soon. LightLDA definitely worth looking into I think, at this point though my focus is on getting the FastLDA Gibbs to a mergable state (tests pass, refactoring/api for LDA is good, and performs at scale as good as or better than EM). > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339847#comment-14339847 ] Guoqiang Li commented on SPARK-5556: [This branch|https://github.com/witgo/spark/tree/lda_Gibbs]'s computational complexity is O(Ndk), is the number of topic (unique) in document d > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6055) memory leak in pyspark sql
[ https://issues.apache.org/jira/browse/SPARK-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339846#comment-14339846 ] Apache Spark commented on SPARK-6055: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4809 > memory leak in pyspark sql > -- > > Key: SPARK-6055 > URL: https://issues.apache.org/jira/browse/SPARK-6055 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.1, 1.3.0, 1.2.1 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > The __eq__ of DataType is not correct, class cache is not use correctly > (created class can not be find by dataType), then it will create lots of > classes (saved in _cached_cls), never released. > Also, all same DataType have same hash code, there will be many object in a > dict with the same hash code, end with hash attach, it's very slow to access > this dict (depends on the implementation of CPython). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6055) memory leak in pyspark sql
[ https://issues.apache.org/jira/browse/SPARK-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339840#comment-14339840 ] Apache Spark commented on SPARK-6055: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4808 > memory leak in pyspark sql > -- > > Key: SPARK-6055 > URL: https://issues.apache.org/jira/browse/SPARK-6055 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.1, 1.3.0, 1.2.1 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > The __eq__ of DataType is not correct, class cache is not use correctly > (created class can not be find by dataType), then it will create lots of > classes (saved in _cached_cls), never released. > Also, all same DataType have same hash code, there will be many object in a > dict with the same hash code, end with hash attach, it's very slow to access > this dict (depends on the implementation of CPython). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6054) SQL UDF returning object of case class; regression from 1.2.0
Spiro Michaylov created SPARK-6054: -- Summary: SQL UDF returning object of case class; regression from 1.2.0 Key: SPARK-6054 URL: https://issues.apache.org/jira/browse/SPARK-6054 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Environment: Windows 8, Scala 2.11.2, Spark 1.3.0 RC1 Reporter: Spiro Michaylov The following code fails with a stack trace beginning with: 15/02/26 23:21:32 ERROR Executor: Exception in task 2.0 in stage 7.0 (TID 422) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: scalaUDF(sales#2,discounts#3) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:237) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:192) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:207) Here is the 1.3.0 version of the code: case class SalesDisc(sales: Double, discounts: Double) def makeStruct(sales: Double, disc:Double) = SalesDisc(sales, disc) sqlContext.udf.register("makeStruct", makeStruct _) val withStruct = sqlContext.sql("SELECT id, sd.sales FROM (SELECT id, makeStruct(sales, discounts) AS sd FROM customerTable) AS d") withStruct.foreach(println) This used to work in 1.2.0. Interestingly, the following simplified version fails similarly, even though it seems to me to be VERY similar to the last test in the UDFSuite: SELECT makeStruct(sales, discounts) AS sd FROM customerTable The data table is defined thus: val custs = Seq( Cust(1, "Widget Co", 12.00, 0.00, "AZ"), Cust(2, "Acme Widgets", 410500.00, 500.00, "CA"), Cust(3, "Widgetry", 410500.00, 200.00, "CA"), Cust(4, "Widgets R Us", 410500.00, 0.0, "CA"), Cust(5, "Ye Olde Widgete", 500.00, 0.0, "MA") ) val customerTable = sc.parallelize(custs, 4).toDF() customerTable.registerTempTable("customerTable") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6055) memory leak in pyspark sql
Davies Liu created SPARK-6055: - Summary: memory leak in pyspark sql Key: SPARK-6055 URL: https://issues.apache.org/jira/browse/SPARK-6055 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.2.1, 1.1.1, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker The __eq__ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released. Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6036) EventLog process logic has race condition with Akka actor system
[ https://issues.apache.org/jira/browse/SPARK-6036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6036: - Labels: backport-needed (was: ) > EventLog process logic has race condition with Akka actor system > > > Key: SPARK-6036 > URL: https://issues.apache.org/jira/browse/SPARK-6036 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.3.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye > Labels: backport-needed > Fix For: 1.4.0 > > > when application finished, akka actor system will trigger disassociated > event, and Master will rebuild SparkUI on web, in which progress will check > whether the eventlog files are still in progress. The current logic in > SparkContext is first stop the actorsystem, and then stop enentLogListener. > This will cause that the enentLogListener has not finished renaming the > eventLog dir name (from app-.inprogress to app-) when Spark Master > try to access the dir. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6036) EventLog process logic has race condition with Akka actor system
[ https://issues.apache.org/jira/browse/SPARK-6036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6036: - Target Version/s: 1.4.0, 1.3.1 Fix Version/s: 1.4.0 Assignee: Zhang, Liye > EventLog process logic has race condition with Akka actor system > > > Key: SPARK-6036 > URL: https://issues.apache.org/jira/browse/SPARK-6036 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.3.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye > Labels: backport-needed > Fix For: 1.4.0 > > > when application finished, akka actor system will trigger disassociated > event, and Master will rebuild SparkUI on web, in which progress will check > whether the eventlog files are still in progress. The current logic in > SparkContext is first stop the actorsystem, and then stop enentLogListener. > This will cause that the enentLogListener has not finished renaming the > eventLog dir name (from app-.inprogress to app-) when Spark Master > try to access the dir. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6050) Spark on YARN does not work --executor-cores is specified
[ https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339798#comment-14339798 ] Mridul Muralidharan commented on SPARK-6050: With more verbose debug added, the problem surfaces. Atleast with hadoop 2.5, the returned response always has vCores == 1 (and at the RM, it is treated as vCores == 1 too ... sigh, unimplemented ?) So in effect, we must not set executorCores while creating "resource" in YarnAllocator. See below for log snippet : 15/02/27 06:37:33 INFO YarnAllocator: Will request 1 executor containers, each with 2 cores and 32870 MB memory including 2150 MB overhead 15/02/27 06:37:33 DEBUG AMRMClientImpl: Added priority=1 15/02/27 06:37:33 DEBUG AMRMClientImpl: addResourceRequest: applicationId= priority=1 resourceName=* numContainers=1 #asks=1 15/02/27 06:37:33 INFO YarnAllocator: Container request (host: Any, capability: ) 15/02/27 06:37:33 INFO YarnAllocator: missing = 0, targetNumExecutors = 1, numPendingAllocate = 1, numExecutorsRunning = 0 15/02/27 06:37:33 INFO AMRMClientImpl: Received new token for : :8041 15/02/27 06:37:33 DEBUG YarnAllocator: Allocated containers: 1. Current executor count: 0. Cluster resources: . 15/02/27 06:37:33 INFO YarnAllocator: allocatedContainer = Container: [ContainerId: , NodeId: :8041, NodeHttpAddress: :8042, Resource: , Priority: 1, Token: Token { kind: ContainerToken, service: :8041 }, ], location = 15/02/27 06:37:33 INFO YarnAllocator: allocatedContainer = Container: [ContainerId: , NodeId: :8041, NodeHttpAddress: :8042, Resource: , Priority: 1, Token: Token { kind: ContainerToken, service: :8041 }, ], location = / 15/02/27 06:37:33 INFO YarnAllocator: allocatedContainer = Container: [ContainerId: , NodeId: :8041, NodeHttpAddress: :8042, Resource: , Priority: 1, Token: Token { kind: ContainerToken, service: :8041 }, ], location = * 15/02/27 06:37:33 DEBUG YarnAllocator: Releasing 1 unneeded containers that were allocated to us 15/02/27 06:37:33 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 0 of them. > Spark on YARN does not work --executor-cores is specified > - > > Key: SPARK-6050 > URL: https://issues.apache.org/jira/browse/SPARK-6050 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 > Environment: 2.5 based YARN cluster. >Reporter: Mridul Muralidharan >Priority: Blocker > > There are multiple issues here (which I will detail as comments), but to > reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC > ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master > yarn-cluster --executor-cores 8--num-executors 15 --driver-memory 4g >--executor-memory 2g --queue webmap lib/spark-examples*.jar > 10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6053) Support model save/load in Python's ALS.
Xiangrui Meng created SPARK-6053: Summary: Support model save/load in Python's ALS. Key: SPARK-6053 URL: https://issues.apache.org/jira/browse/SPARK-6053 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor It should be a simple wrapper of the Scala's implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5991) Python API for ML model import/export
[ https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5991: - Issue Type: Umbrella (was: Sub-task) Parent: (was: SPARK-4587) > Python API for ML model import/export > - > > Key: SPARK-5991 > URL: https://issues.apache.org/jira/browse/SPARK-5991 > Project: Spark > Issue Type: Umbrella > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Critical > > Many ML models support save/load in Scala and Java. The Python API needs > this. It should mostly be a simple matter of calling the JVM methods for > save/load, except for models which are stored in Python (e.g., linear models). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5991) Python API for ML model import/export
[ https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5991: - Target Version/s: (was: 1.4.0) > Python API for ML model import/export > - > > Key: SPARK-5991 > URL: https://issues.apache.org/jira/browse/SPARK-5991 > Project: Spark > Issue Type: Umbrella > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Critical > > Many ML models support save/load in Scala and Java. The Python API needs > this. It should mostly be a simple matter of calling the JVM methods for > save/load, except for models which are stored in Python (e.g., linear models). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5845) Time to cleanup spilled shuffle files not included in shuffle write time
[ https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339794#comment-14339794 ] Ilya Ganelin commented on SPARK-5845: - I'm code complete on this, will submit a PR shortly. > Time to cleanup spilled shuffle files not included in shuffle write time > > > Key: SPARK-5845 > URL: https://issues.apache.org/jira/browse/SPARK-5845 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kay Ousterhout >Assignee: Ilya Ganelin >Priority: Minor > > When the disk is contended, I've observed cases when it takes as long as 7 > seconds to clean up all of the intermediate spill files for a shuffle (when > using the sort based shuffle, but bypassing merging because there are <=200 > shuffle partitions). This is even when the shuffle data is non-huge (152MB > written from one of the tasks where I observed this). This is effectively > part of the shuffle write time (because it's a necessary side effect of > writing data to disk) so should be added to the shuffle write time to > facilitate debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339779#comment-14339779 ] Sangkyoon Nam commented on SPARK-5281: -- I have same problem. In my case, I used CDH 5.3.x > Registering table on RDD is giving MissingRequirementError > -- > > Key: SPARK-5281 > URL: https://issues.apache.org/jira/browse/SPARK-5281 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: sarsol >Priority: Critical > > Application crashes on this line rdd.registerTempTable("temp") in 1.2 > version when using sbt or Eclipse SCALA IDE > Stacktrace > Exception in thread "main" scala.reflect.internal.MissingRequirementError: > class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with > primordial classloader with boot classpath > [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program > Files\Java\jre7\lib\resources.jar;C:\Program > Files\Java\jre7\lib\rt.jar;C:\Program > Files\Java\jre7\lib\sunrsasign.jar;C:\Program > Files\Java\jre7\lib\jsse.jar;C:\Program > Files\Java\jre7\lib\jce.jar;C:\Program > Files\Java\jre7\lib\charsets.jar;C:\Program > Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. > at > scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) > at > scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) > at > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) > at > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) > at > scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) > at > scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) > at > scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) > at > org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) > at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) > at scala.reflect.api.Universe.typeOf(Universe.scala:59) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) > at > org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) > at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) > at > com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) > at scala.Function0$class.apply$mcV$sp(Function0.scala:40) > at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) > at scala.App$$anonfun$main$1.apply(App.scala:71) > at scala.App$$anonfun$main$1.apply(App.scala:71) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) > at scala.App$class.main(App.scala:71) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339772#comment-14339772 ] Pedro Rodriguez commented on SPARK-5556: See PR for info, TLDR: contains refactoring for multiple LDA algorithms, including how EM would be refactored. Will in the near future contain Gibbs implementation I have/had been working on. > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339767#comment-14339767 ] Apache Spark commented on SPARK-5556: - User 'EntilZha' has created a pull request for this issue: https://github.com/apache/spark/pull/4807 > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6052) In JSON schema inference, we should always set containsNull of an ArrayType to true
[ https://issues.apache.org/jira/browse/SPARK-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339761#comment-14339761 ] Apache Spark commented on SPARK-6052: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4806 > In JSON schema inference, we should always set containsNull of an ArrayType > to true > --- > > Key: SPARK-6052 > URL: https://issues.apache.org/jira/browse/SPARK-6052 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > We should not try to figure out if an array contains null or not because we > may miss arrays with null if we do sampling or future data may have nulls in > the array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6051) Add an option for DirectKafkaInputDStream to commit the offsets into ZK
[ https://issues.apache.org/jira/browse/SPARK-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339758#comment-14339758 ] Apache Spark commented on SPARK-6051: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/4805 > Add an option for DirectKafkaInputDStream to commit the offsets into ZK > --- > > Key: SPARK-6051 > URL: https://issues.apache.org/jira/browse/SPARK-6051 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Saisai Shao > > Currently in DirectKafkaInputDStream, offset is managed by Spark Streaming > itself without ZK or Kafka involved, which will make several third-party > offset monitoring tools fail to monitor the status of Kafka consumer. So here > as a option to commit the offset to ZK when each job is finished, the process > is implemented as a asynchronized way, so the main processing flow will not > be blocked, already tested with KafkaOffsetMonitor tools. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6050) Spark on YARN does not work --executor-cores is specified
[ https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339686#comment-14339686 ] Mridul Muralidharan edited comment on SPARK-6050 at 2/27/15 5:24 AM: - Thanks to [~tgraves] for helping investigate this. There are multiple issues in the codebase - and not all of them have been fully understood. a) For some reason, either YARN returns incorrect response to an allocate request or we are not setting the right param. Note the snippet [1] to detail this. (I cant share the logs unfortunately - but Tom has access to it and should be trivial for others to reproduce the issue). b) For whatever reason (a) happens, we do not recover from it. All subsequent requests heartbeat requests DO NOT contain pending allocation requests (and we have rejected/de-allocated whatever yarn just sent us due to (a)). To elaborate; updateResourceRequests has missing == 0 since it is relying on getNumPendingAllocate() - which DOES NOT do the right thing in our context. Note: the 'ask' list in the super class was cleared as part of the previous allocate() call. Fixing (a) will mask (b) - but IMO we should address it at the earliest too. [1] Note the vCore in the response, and the subsequent ignoring of all containers. 15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, each with 8 cores and 38912 MB memory including 10240 MB overhead 15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: ) 15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - sleep time : 5000 15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current executor count: 0. Cluster resources: . 15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded containers that were allocated to us 15/02/27 01:40:36 INFO YarnAllocator: Received 1000 containers from YARN, launching executors on 0 of them. was (Author: mridulm80): Thanks to [~tgraves] for helping investigate this. There are multiple issues in the codebase - and not all of them have been fully understood. a) For some reason, either YARN returns incorrect response to an allocate request or we are not setting the right param. Note the snippet [1] to detail this. (I cant share the logs unfortunately - but Tom has access to it and should be trivial for others to reproduce the issue). b) For whatever reason (a) happens, we do not recover from it. All subsequent requests heartbeat requests DO NOT contain pending allocation requests (and we have rejected/de-allocated whatever yarn just sent us due to (a)). To elaborate; updateResourceRequests has missing == 0 since it is relying on getNumPendingAllocate() - which DOES NOT do the right thing in our context. Note: the 'ask' list in the super class was cleared as part of the previous allocate() call. Essentially we were defending against these sort of corner cases in our code earlier - but the move to depend on AMRMClientImpl and the subsequent changes to it from under us has caused these problems for spark IMO. We should be more careful in future and only depend on interfaces and not implementation when it is relatively straight forward for us to own that aspect. Fixing (a) will mask (b) - but IMO we should address it at the earliest too. [1] Note the vCore in the response, and the subsequent ignoring of all containers. 15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, each with 8 cores and 38912 MB memory including 10240 MB overhead 15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: ) 15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - sleep time : 5000 15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current executor count: 0. Cluster resources: . 15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded containers that were allocated to us 15/02/27 01:40:36 INFO YarnAllocator: Received 1000 containers from YARN, launching executors on 0 of them. > Spark on YARN does not work --executor-cores is specified > --
[jira] [Created] (SPARK-6052) In JSON schema inference, we should always set containsNull of an ArrayType to true
Yin Huai created SPARK-6052: --- Summary: In JSON schema inference, we should always set containsNull of an ArrayType to true Key: SPARK-6052 URL: https://issues.apache.org/jira/browse/SPARK-6052 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker We should not try to figure out if an array contains null or not because we may miss arrays with null if we do sampling or future data may have nulls in the array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6051) Add an option for DirectKafkaInputDStream to commit the offsets into ZK
Saisai Shao created SPARK-6051: -- Summary: Add an option for DirectKafkaInputDStream to commit the offsets into ZK Key: SPARK-6051 URL: https://issues.apache.org/jira/browse/SPARK-6051 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Currently in DirectKafkaInputDStream, offset is managed by Spark Streaming itself without ZK or Kafka involved, which will make several third-party offset monitoring tools fail to monitor the status of Kafka consumer. So here as a option to commit the offset to ZK when each job is finished, the process is implemented as a asynchronized way, so the main processing flow will not be blocked, already tested with KafkaOffsetMonitor tools. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6024) When a data source table has too many columns, it's schema cannot be stored in metastore.
[ https://issues.apache.org/jira/browse/SPARK-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6024. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Yin Huai > When a data source table has too many columns, it's schema cannot be stored > in metastore. > - > > Key: SPARK-6024 > URL: https://issues.apache.org/jira/browse/SPARK-6024 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.3.0 > > > Because we are using table properties of a Hive metastore table to store the > schema, when a schema is too wide, we cannot persist it in metastore. > {code} > 15/02/25 18:13:50 ERROR metastore.RetryingHMSHandler: Retrying HMSHandler > after 1000 ms (attempt 1 of 1) with error: javax.jdo.JDODataStoreException: > Put request failed : INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) > VALUES (?,?,?) > at > org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451) > at > org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732) > at > org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752) > at > org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:719) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108) > at com.sun.proxy.$Proxy15.createTable(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1261) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) > at com.sun.proxy.$Proxy16.create_table_with_environment_context(Unknown > Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:558) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:547) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) > at com.sun.proxy.$Proxy17.createTable(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:613) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:136) > at > org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:243) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55) > at > org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) > at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1013) > at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:963) > at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:929) > at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:907) > at > $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25) > at > $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30) > at > $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32) > at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:34) > at $line39.$read$$iwC$$i
[jira] [Commented] (SPARK-5984) TimSort broken
[ https://issues.apache.org/jira/browse/SPARK-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339706#comment-14339706 ] Apache Spark commented on SPARK-5984: - User 'hotou' has created a pull request for this issue: https://github.com/apache/spark/pull/4804 > TimSort broken > -- > > Key: SPARK-5984 > URL: https://issues.apache.org/jira/browse/SPARK-5984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1 >Reporter: Reynold Xin >Assignee: Aaron Davidson >Priority: Minor > > See > http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/ > Our TimSort is based on Android's TimSort, which is broken in some corner > case. Marking it minor as this problem exists for almost all TimSort > implementations out there, including Android, OpenJDK, Python, and it hasn't > manifested itself in practice yet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3664) Graduate GraphX from alpha to stable
[ https://issues.apache.org/jira/browse/SPARK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-3664. --- Resolution: Fixed Fix Version/s: 1.2.0 > Graduate GraphX from alpha to stable > > > Key: SPARK-3664 > URL: https://issues.apache.org/jira/browse/SPARK-3664 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > Fix For: 1.2.0 > > > The GraphX API is officially marked as alpha but has been moving toward > stability. This ticket tracks what will be necessary to mark it a stable part > of Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1015) Visualize the DAG of RDD
[ https://issues.apache.org/jira/browse/SPARK-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339690#comment-14339690 ] Jeff Zhang edited comment on SPARK-1015 at 2/27/15 4:03 AM: [~sowen] I may not have time for this recently. bq. How would the visualization work with spark-shell? Is this just a utility you can host outside Spark? I would prefer to use graphviz for visualize the RDD. And spark just build the dot file for graphviz and let the graphviz to visualize it. Besides, I think integrating the DAG view to spark ui may be helpful for users to debug the RDD (especially on performance perspective ) was (Author: zjffdu): [~sowen] I may not have time for this recently. bq. How would the visualization work with spark-shell? Is this just a utility you can host outside Spark? I would prefer to use graphviz for visualize the RDD. And spark just build the dot file for graphviz and let the graphviz to visualize it. > Visualize the DAG of RDD > - > > Key: SPARK-1015 > URL: https://issues.apache.org/jira/browse/SPARK-1015 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Jeff Zhang > > The DAG of RDD can help user understand the data flow and how spark get the > final RDD executed. It could help user to find chances to optimize the > execution of some complex RDD. I will leverage graphviz to visualize the > DAG. > For this task, I plan to split it into 2 steps. > Step 1. Just visualize the simple DAG graph. Each RDD is one node, and > there will be one edge between the parent RDD and child RDD. ( I attach one > simple graph in the attachments ) > Step 2. Put RDD in the same stage into one sub graph. This may need to > extract the splitting staging related code in DAGSchduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1015) Visualize the DAG of RDD
[ https://issues.apache.org/jira/browse/SPARK-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339690#comment-14339690 ] Jeff Zhang commented on SPARK-1015: --- [~sowen] I may not have time for this recently. bq. How would the visualization work with spark-shell? Is this just a utility you can host outside Spark? I would prefer to use graphviz for visualize the RDD. And spark just build the dot file for graphviz and let the graphviz to visualize it. > Visualize the DAG of RDD > - > > Key: SPARK-1015 > URL: https://issues.apache.org/jira/browse/SPARK-1015 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Jeff Zhang > > The DAG of RDD can help user understand the data flow and how spark get the > final RDD executed. It could help user to find chances to optimize the > execution of some complex RDD. I will leverage graphviz to visualize the > DAG. > For this task, I plan to split it into 2 steps. > Step 1. Just visualize the simple DAG graph. Each RDD is one node, and > there will be one edge between the parent RDD and child RDD. ( I attach one > simple graph in the attachments ) > Step 2. Put RDD in the same stage into one sub graph. This may need to > extract the splitting staging related code in DAGSchduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6050) Spark on YARN does not work --executor-cores is specified
[ https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339686#comment-14339686 ] Mridul Muralidharan edited comment on SPARK-6050 at 2/27/15 3:50 AM: - Thanks to [~tgraves] for helping investigate this. There are multiple issues in the codebase - and not all of them have been fully understood. a) For some reason, either YARN returns incorrect response to an allocate request or we are not setting the right param. Note the snippet [1] to detail this. (I cant share the logs unfortunately - but Tom has access to it and should be trivial for others to reproduce the issue). b) For whatever reason (a) happens, we do not recover from it. All subsequent requests heartbeat requests DO NOT contain pending allocation requests (and we have rejected/de-allocated whatever yarn just sent us due to (a)). To elaborate; updateResourceRequests has missing == 0 since it is relying on getNumPendingAllocate() - which DOES NOT do the right thing in our context. Note: the 'ask' list in the super class was cleared as part of the previous allocate() call. Essentially we were defending against these sort of corner cases in our code earlier - but the move to depend on AMRMClientImpl and the subsequent changes to it from under us has caused these problems for spark IMO. We should be more careful in future and only depend on interfaces and not implementation when it is relatively straight forward for us to own that aspect. Fixing (a) will mask (b) - but IMO we should address it at the earliest too. [1] Note the vCore in the response, and the subsequent ignoring of all containers. 15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, each with 8 cores and 38912 MB memory including 10240 MB overhead 15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: ) 15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - sleep time : 5000 15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current executor count: 0. Cluster resources: . 15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded containers that were allocated to us 15/02/27 01:40:36 INFO YarnAllocator: Received 1000 containers from YARN, launching executors on 0 of them. was (Author: mridulm80): Thanks to [~tgraves] for helping investigate this. There are multiple issues in the codebase - and not all of them have been fully understood. a) For some reason, either YARN returns incorrect response to an allocate request or we are not setting the right param. Note the snippet [1] to detail this. (I cant share the logs unfortunately - but Tom has access to it and should be trivial for others to reproduce the issue). b) For whatever reason (a) happens, we do not recover from it. All subsequent requests heartbeat requests DO NOT contain pending allocation requests (and we have rejected/de-allocated whatever yarn just sent us due to (a)). To elaborate; updateResourceRequests has missing == 0 since it is relying on getNumPendingAllocate() - which DOES NOT do the right thing in our context. Note: the 'ask' list in the super class was cleared as part of the previous allocate() call. Essentially we were defending against these sort of corner cases in our code earlier - but the move to depend on AMRMClientImpl and the subsequent changes to it from under us has caused these problems for spark. Fixing (a) will mask (b) - but IMO we should address it at the earliest too. [1] Not the vCore in the response, and the subsequent ignoring of all containers. 15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, each with 8 cores and 38912 MB memory including 10240 MB overhead 15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: ) 15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - sleep time : 5000 15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current executor count: 0. Cluster resources: . 15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded container
[jira] [Commented] (SPARK-6050) Spark on YARN does not work --executor-cores is specified
[ https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339686#comment-14339686 ] Mridul Muralidharan commented on SPARK-6050: Thanks to [~tgraves] for helping investigate this. There are multiple issues in the codebase - and not all of them have been fully understood. a) For some reason, either YARN returns incorrect response to an allocate request or we are not setting the right param. Note the snippet [1] to detail this. (I cant share the logs unfortunately - but Tom has access to it and should be trivial for others to reproduce the issue). b) For whatever reason (a) happens, we do not recover from it. All subsequent requests heartbeat requests DO NOT contain pending allocation requests (and we have rejected/de-allocated whatever yarn just sent us due to (a)). To elaborate; updateResourceRequests has missing == 0 since it is relying on getNumPendingAllocate() - which DOES NOT do the right thing in our context. Note: the 'ask' list in the super class was cleared as part of the previous allocate() call. Essentially we were defending against these sort of corner cases in our code earlier - but the move to depend on AMRMClientImpl and the subsequent changes to it from under us has caused these problems for spark. Fixing (a) will mask (b) - but IMO we should address it at the earliest too. [1] Not the vCore in the response, and the subsequent ignoring of all containers. 15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, each with 8 cores and 38912 MB memory including 10240 MB overhead 15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: ) 15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - sleep time : 5000 15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current executor count: 0. Cluster resources: . 15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded containers that were allocated to us 15/02/27 01:40:36 INFO YarnAllocator: Received 1000 containers from YARN, launching executors on 0 of them. > Spark on YARN does not work --executor-cores is specified > - > > Key: SPARK-6050 > URL: https://issues.apache.org/jira/browse/SPARK-6050 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 > Environment: 2.5 based YARN cluster. >Reporter: Mridul Muralidharan >Priority: Blocker > > There are multiple issues here (which I will detail as comments), but to > reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC > ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master > yarn-cluster --executor-cores 8--num-executors 15 --driver-memory 4g >--executor-memory 2g --queue webmap lib/spark-examples*.jar > 10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6033) the description abou the "spark.worker.cleanup.enabled" is not matched with the code
[ https://issues.apache.org/jira/browse/SPARK-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339684#comment-14339684 ] pengxu commented on SPARK-6033: --- I've already made a PR. Can you help me to review it, thx > the description abou the "spark.worker.cleanup.enabled" is not matched with > the code > > > Key: SPARK-6033 > URL: https://issues.apache.org/jira/browse/SPARK-6033 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.2.0, 1.2.1 >Reporter: pengxu >Priority: Minor > > Some error about the section _Cluster Launch Scripts_ in the > http://spark.apache.org/docs/latest/spark-standalone.html > In the description about the property spark.worker.cleanup.enabled, it states > that *all the directory* under the work dir will be removed whether the > application is running or not. > After checking the implementation in the code level, I found that +only the > stopped application+ dirs would be removed. So the description in the > document is incorrect. > the code implementation in worker.scala > {code: title=WorkDirCleanup} > case WorkDirCleanup => > // Spin up a separate thread (in a future) to do the dir cleanup; don't > tie up worker actor > val cleanupFuture = concurrent.future { > val appDirs = workDir.listFiles() > if (appDirs == null) { > throw new IOException("ERROR: Failed to list files in " + appDirs) > } > appDirs.filter { dir => > // the directory is used by an application - check that the > application is not running > // when cleaning up > val appIdFromDir = dir.getName > val isAppStillRunning = > executors.values.map(_.appId).contains(appIdFromDir) > dir.isDirectory && !isAppStillRunning && > !Utils.doesDirectoryContainAnyNewFiles(dir, APP_DATA_RETENTION_SECS) > }.foreach { dir => > logInfo(s"Removing directory: ${dir.getPath}") > Utils.deleteRecursively(dir) > } > } > cleanupFuture onFailure { > case e: Throwable => > logError("App dir cleanup failed: " + e.getMessage, e) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6033) the description abou the "spark.worker.cleanup.enabled" is not matched with the code
[ https://issues.apache.org/jira/browse/SPARK-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339683#comment-14339683 ] Apache Spark commented on SPARK-6033: - User 'hseagle' has created a pull request for this issue: https://github.com/apache/spark/pull/4803 > the description abou the "spark.worker.cleanup.enabled" is not matched with > the code > > > Key: SPARK-6033 > URL: https://issues.apache.org/jira/browse/SPARK-6033 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.2.0, 1.2.1 >Reporter: pengxu >Priority: Minor > > Some error about the section _Cluster Launch Scripts_ in the > http://spark.apache.org/docs/latest/spark-standalone.html > In the description about the property spark.worker.cleanup.enabled, it states > that *all the directory* under the work dir will be removed whether the > application is running or not. > After checking the implementation in the code level, I found that +only the > stopped application+ dirs would be removed. So the description in the > document is incorrect. > the code implementation in worker.scala > {code: title=WorkDirCleanup} > case WorkDirCleanup => > // Spin up a separate thread (in a future) to do the dir cleanup; don't > tie up worker actor > val cleanupFuture = concurrent.future { > val appDirs = workDir.listFiles() > if (appDirs == null) { > throw new IOException("ERROR: Failed to list files in " + appDirs) > } > appDirs.filter { dir => > // the directory is used by an application - check that the > application is not running > // when cleaning up > val appIdFromDir = dir.getName > val isAppStillRunning = > executors.values.map(_.appId).contains(appIdFromDir) > dir.isDirectory && !isAppStillRunning && > !Utils.doesDirectoryContainAnyNewFiles(dir, APP_DATA_RETENTION_SECS) > }.foreach { dir => > logInfo(s"Removing directory: ${dir.getPath}") > Utils.deleteRecursively(dir) > } > } > cleanupFuture onFailure { > case e: Throwable => > logError("App dir cleanup failed: " + e.getMessage, e) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6050) Spark on YARN does not work --executor-cores is specified
Mridul Muralidharan created SPARK-6050: -- Summary: Spark on YARN does not work --executor-cores is specified Key: SPARK-6050 URL: https://issues.apache.org/jira/browse/SPARK-6050 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Environment: 2.5 based YARN cluster. Reporter: Mridul Muralidharan Priority: Blocker There are multiple issues here (which I will detail as comments), but to reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --executor-cores 8--num-executors 15 --driver-memory 4g --executor-memory 2g --queue webmap lib/spark-examples*.jar 10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6033) the description abou the "spark.worker.cleanup.enabled" is not matched with the code
[ https://issues.apache.org/jira/browse/SPARK-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339667#comment-14339667 ] pengxu commented on SPARK-6033: --- Ok, I'll do it. > the description abou the "spark.worker.cleanup.enabled" is not matched with > the code > > > Key: SPARK-6033 > URL: https://issues.apache.org/jira/browse/SPARK-6033 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.2.0, 1.2.1 >Reporter: pengxu >Priority: Minor > > Some error about the section _Cluster Launch Scripts_ in the > http://spark.apache.org/docs/latest/spark-standalone.html > In the description about the property spark.worker.cleanup.enabled, it states > that *all the directory* under the work dir will be removed whether the > application is running or not. > After checking the implementation in the code level, I found that +only the > stopped application+ dirs would be removed. So the description in the > document is incorrect. > the code implementation in worker.scala > {code: title=WorkDirCleanup} > case WorkDirCleanup => > // Spin up a separate thread (in a future) to do the dir cleanup; don't > tie up worker actor > val cleanupFuture = concurrent.future { > val appDirs = workDir.listFiles() > if (appDirs == null) { > throw new IOException("ERROR: Failed to list files in " + appDirs) > } > appDirs.filter { dir => > // the directory is used by an application - check that the > application is not running > // when cleaning up > val appIdFromDir = dir.getName > val isAppStillRunning = > executors.values.map(_.appId).contains(appIdFromDir) > dir.isDirectory && !isAppStillRunning && > !Utils.doesDirectoryContainAnyNewFiles(dir, APP_DATA_RETENTION_SECS) > }.foreach { dir => > logInfo(s"Removing directory: ${dir.getPath}") > Utils.deleteRecursively(dir) > } > } > cleanupFuture onFailure { > case e: Throwable => > logError("App dir cleanup failed: " + e.getMessage, e) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6037) Avoiding duplicate Parquet schema merging
[ https://issues.apache.org/jira/browse/SPARK-6037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6037. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4786 [https://github.com/apache/spark/pull/4786] > Avoiding duplicate Parquet schema merging > - > > Key: SPARK-6037 > URL: https://issues.apache.org/jira/browse/SPARK-6037 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 1.3.0 > > > FilteringParquetRowInputFormat manually merges Parquet schemas before > computing splits. However, it is duplicate because the schemas are already > merged in ParquetRelation2. We don't need to re-merge them at InputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6037) Avoiding duplicate Parquet schema merging
[ https://issues.apache.org/jira/browse/SPARK-6037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6037: -- Assignee: Liang-Chi Hsieh > Avoiding duplicate Parquet schema merging > - > > Key: SPARK-6037 > URL: https://issues.apache.org/jira/browse/SPARK-6037 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Minor > > FilteringParquetRowInputFormat manually merges Parquet schemas before > computing splits. However, it is duplicate because the schemas are already > merged in ParquetRelation2. We don't need to re-merge them at InputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5710) Combines two adjacent `Cast` expressions into one
[ https://issues.apache.org/jira/browse/SPARK-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339649#comment-14339649 ] guowei commented on SPARK-5710: --- How about limit merging adjacent casts that are only added in `typeCoercionRules` ? We can add a label in `cast` to mark them which are added in `typeCoercionRules`. > Combines two adjacent `Cast` expressions into one > - > > Key: SPARK-5710 > URL: https://issues.apache.org/jira/browse/SPARK-5710 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.1 >Reporter: guowei >Priority: Minor > > A plan after `analyzer` with `typeCoercionRules` may produce many `cast` > expressions. we can combine the adjacent ones. > For example. > create table test(a decimal(3,1)); > explain select * from test where a*2-1>1; > == Physical Plan == > Filter (CAST(CAST((CAST(CAST((CAST(a#5, DecimalType()) * 2), > DecimalType(21,1)), DecimalType()) - 1), DecimalType(22,1)), DecimalType()) > > 1) > HiveTableScan [a#5], (MetastoreRelation default, test, None), None -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5529. Resolution: Fixed Fix Version/s: 1.4.0 Target Version/s: 1.4.0 > BlockManager heartbeat expiration does not kill executor > > > Key: SPARK-5529 > URL: https://issues.apache.org/jira/browse/SPARK-5529 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.2.0 >Reporter: Hong Shen >Assignee: Hong Shen > Fix For: 1.4.0 > > Attachments: SPARK-5529.patch > > > When I run a spark job, one executor is hold, after 120s, blockManager is > removed by driver, but after half an hour before the executor is remove by > driver. Here is the log: > {code} > 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms > exceeds 12ms > > 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on > 10.215.143.14: remote Akka client disassociated > 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet > 0.0 > 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, > 10.215.143.14): ExecutorLostFailure (executor 1 lost) > 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove > non-existent executor 1 > 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 > from BlockManagerMaster. > 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
[ https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339629#comment-14339629 ] Patrick Wendell edited comment on SPARK-6048 at 2/27/15 2:33 AM: - Hey All, No opinions on which design we chose to implement internally. However, I do feel strongly that the user-facing precedence should not change between versions. It's not reasonable to assume that no user has both the old and new names for a config value. Configuration files can be very long, or there can be multiple sources of configuration, such as a user using both flags and a file. So changing the semantics randomly in a release constitutes a breaking change of behavior. In terms of the nicest possible way to achieve these semantics, that's up to you guys since you're much more familiar with this code. The current patch seems to just rewind the behavior that was introduced earlier. Marcello, unless you see some correctness problem with that patch, I'd like to merge it to unblock the release. If you guys think it's way better to do translation on writes than reads, it's fine to propose that in a new patch. was (Author: pwendell): Hey All, No options on which design we chose to implement internally. However, I do feel strongly that the user-facing precedence should not change between versions. It's not reasonable to assume that no user has both the old and new names for a config value. Configuration files can be very long, or there can be multiple sources of configuration, such as a user using both flags and a file. So changing the semantics randomly in a release constitutes a breaking change of behavior. In terms of the nicest possible way to achieve these semantics, that's up to you guys since you're much more familiar with this code. The current patch seems to just rewind the behavior that was introduced earlier. Marcello, unless you see some correctness problem with that patch, I'd like to merge it to unblock the release. If you guys think it's way better to do translation on writes than reads, it's fine to propose that in a new patch. > SparkConf.translateConfKey should translate on get, not set > --- > > Key: SPARK-6048 > URL: https://issues.apache.org/jira/browse/SPARK-6048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > There are several issues with translating on set. > (1) The most serious one is that if the user has both the deprecated and the > latest version of the same config set, then the value picked up by SparkConf > will be arbitrary. Why? Because during initialization of the conf we call > `conf.set` on each property in `sys.props` in an order arbitrarily defined by > Java. As a result, the value of the more recent config may be overridden by > that of the deprecated one. Instead, we should always use the value of the > most recent config. > (2) If we translate on set, then we must keep translating everywhere else. In > fact, the current code does not translate on remove, which means the > following won't work if X is deprecated: > {code} > conf.set(X, Y) > conf.remove(X) // X is not in the conf > {code} > This requires us to also translate in remove and other places, as we already > do for contains, leading to more duplicate code. > (3) Since we call `conf.set` on all configs when initializing the conf, we > print all deprecation warnings in the beginning. Elsewhere in Spark, however, > we warn the user when the deprecated config / option / env var is actually > being used. > We should keep this consistent so the user won't expect to find all > deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
[ https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339629#comment-14339629 ] Patrick Wendell commented on SPARK-6048: Hey All, No options on which design we chose to implement internally. However, I do feel strongly that the user-facing precedence should not change between versions. It's not reasonable to assume that no user has both the old and new names for a config value. Configuration files can be very long, or there can be multiple sources of configuration, such as a user using both flags and a file. So changing the semantics randomly in a release constitutes a breaking change of behavior. In terms of the nicest possible way to achieve these semantics, that's up to you guys since you're much more familiar with this code. The current patch seems to just rewind the behavior that was introduced earlier. Marcello, unless you see some correctness problem with that patch, I'd like to merge it to unblock the release. If you guys think it's way better to do translation on writes than reads, it's fine to propose that in a new patch. > SparkConf.translateConfKey should translate on get, not set > --- > > Key: SPARK-6048 > URL: https://issues.apache.org/jira/browse/SPARK-6048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > There are several issues with translating on set. > (1) The most serious one is that if the user has both the deprecated and the > latest version of the same config set, then the value picked up by SparkConf > will be arbitrary. Why? Because during initialization of the conf we call > `conf.set` on each property in `sys.props` in an order arbitrarily defined by > Java. As a result, the value of the more recent config may be overridden by > that of the deprecated one. Instead, we should always use the value of the > most recent config. > (2) If we translate on set, then we must keep translating everywhere else. In > fact, the current code does not translate on remove, which means the > following won't work if X is deprecated: > {code} > conf.set(X, Y) > conf.remove(X) // X is not in the conf > {code} > This requires us to also translate in remove and other places, as we already > do for contains, leading to more duplicate code. > (3) Since we call `conf.set` on all configs when initializing the conf, we > print all deprecation warnings in the beginning. Elsewhere in Spark, however, > we warn the user when the deprecated config / option / env var is actually > being used. > We should keep this consistent so the user won't expect to find all > deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
[ https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339624#comment-14339624 ] Andrew Or commented on SPARK-6048: -- bq. What do you mean by "duplicates the translation" (regarding 2)? It's just a call to "translateKey()". It's not about the number of lines that are being duplicated. It's about the translation logic. Right now it's not correct to translate in set but not in all interfaces exposed by SparkConf. As we have seen with the case of `remove` it's easy to miss one or two of these interfaces. If we only translate in `get` then we don't have to worry about this. bq. Regarding 1, that problem exists regardless of my change. That's actually not true. Before your change, if we specify both the deprecated config and the most recent one, the behavior will be determined by the place where these values are used. Even if we called `set` on the deprecated config over the more recent one, the value of the latter is still preserved because we didn't translate on `set`. To answer your question, the expected behavior is for the value of the more recent config to *always* take precedence. bq. Note the goal of the deprecated configs was to make the Spark code only have to care about the most recent key name. Your proposal goes against that, and would require the deprecated names to live both in SparkConf and in the code that needs to read them. Yes, unfortunately, and I agree it's something we need to fix in the future. My eventual goal is to do hide all the deprecation logic throughout the Spark code, and this is why I filed SPARK-5933 before. Currently, however, this is a correctness issue that is blocking the 1.3 release, so my personal opinion is that we should first fix this broken behavior and worry about the code style later. > SparkConf.translateConfKey should translate on get, not set > --- > > Key: SPARK-6048 > URL: https://issues.apache.org/jira/browse/SPARK-6048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > There are several issues with translating on set. > (1) The most serious one is that if the user has both the deprecated and the > latest version of the same config set, then the value picked up by SparkConf > will be arbitrary. Why? Because during initialization of the conf we call > `conf.set` on each property in `sys.props` in an order arbitrarily defined by > Java. As a result, the value of the more recent config may be overridden by > that of the deprecated one. Instead, we should always use the value of the > most recent config. > (2) If we translate on set, then we must keep translating everywhere else. In > fact, the current code does not translate on remove, which means the > following won't work if X is deprecated: > {code} > conf.set(X, Y) > conf.remove(X) // X is not in the conf > {code} > This requires us to also translate in remove and other places, as we already > do for contains, leading to more duplicate code. > (3) Since we call `conf.set` on all configs when initializing the conf, we > print all deprecation warnings in the beginning. Elsewhere in Spark, however, > we warn the user when the deprecated config / option / env var is actually > being used. > We should keep this consistent so the user won't expect to find all > deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5979) `--packages` should not exclude spark streaming assembly jars for kafka and flume
[ https://issues.apache.org/jira/browse/SPARK-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339620#comment-14339620 ] Apache Spark commented on SPARK-5979: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/4802 > `--packages` should not exclude spark streaming assembly jars for kafka and > flume > -- > > Key: SPARK-5979 > URL: https://issues.apache.org/jira/browse/SPARK-5979 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Submit >Affects Versions: 1.3.0 >Reporter: Burak Yavuz >Priority: Blocker > > Currently `--packages` has an exclude rule for all dependencies with the > groupId `org.apache.spark` assuming that these are packaged inside the > spark-assembly jar. This is not the case and more fine grained filtering is > required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6032) Move ivy logging to System.err in --packages
[ https://issues.apache.org/jira/browse/SPARK-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339621#comment-14339621 ] Apache Spark commented on SPARK-6032: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/4802 > Move ivy logging to System.err in --packages > > > Key: SPARK-6032 > URL: https://issues.apache.org/jira/browse/SPARK-6032 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 1.3.0 >Reporter: Burak Yavuz >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6049) HiveThriftServer2 may expose Inheritable methods
Littlestar created SPARK-6049: - Summary: HiveThriftServer2 may expose Inheritable methods Key: SPARK-6049 URL: https://issues.apache.org/jira/browse/SPARK-6049 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor Does HiveThriftServer2 may expose Inheritable methods? HiveThriftServer2 is very good when used as a JDBC Server, but HiveThriftServer2.scala is not Inheritable or invokable by app. My app use JavaSQLContext and registerTempTable. I want to expose these TempTables by HiveThriftServer2(JDBC Server). Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5537) Expand user guide for multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-5537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339602#comment-14339602 ] Apache Spark commented on SPARK-5537: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/4801 > Expand user guide for multinomial logistic regression > - > > Key: SPARK-5537 > URL: https://issues.apache.org/jira/browse/SPARK-5537 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Xiangrui Meng >Assignee: DB Tsai > > We probably don't need to work out the math in the user guide. We can point > users to wikipedia for details and focus on the public APIs and how to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
[ https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339591#comment-14339591 ] Marcelo Vanzin commented on SPARK-6048: --- What do you mean by "duplicates the translation" (regarding 2)? It's just a call to "translateKey()". Regarding 1, that problem exists regardless of my change. You need to specify some precedence order. See the case in FsHistoryProvider, where there are 2 (!) deprecated keys for the same config. What if the user sets those two deprecated keys in the conf? What's the expectation? Perhaps you need to enforce some sort of ordering for the deprecated keys in SparkConf, but I don't see why that means translating on get and not on set. Note the goal of the deprecated configs was to make the Spark code only have to care about the most recent key name. Your proposal goes against that, and would require the deprecated names to live both in SparkConf and in the code that needs to read them. > SparkConf.translateConfKey should translate on get, not set > --- > > Key: SPARK-6048 > URL: https://issues.apache.org/jira/browse/SPARK-6048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > There are several issues with translating on set. > (1) The most serious one is that if the user has both the deprecated and the > latest version of the same config set, then the value picked up by SparkConf > will be arbitrary. Why? Because during initialization of the conf we call > `conf.set` on each property in `sys.props` in an order arbitrarily defined by > Java. As a result, the value of the more recent config may be overridden by > that of the deprecated one. Instead, we should always use the value of the > most recent config. > (2) If we translate on set, then we must keep translating everywhere else. In > fact, the current code does not translate on remove, which means the > following won't work if X is deprecated: > {code} > conf.set(X, Y) > conf.remove(X) // X is not in the conf > {code} > This requires us to also translate in remove and other places, as we already > do for contains, leading to more duplicate code. > (3) Since we call `conf.set` on all configs when initializing the conf, we > print all deprecation warnings in the beginning. Elsewhere in Spark, however, > we warn the user when the deprecated config / option / env var is actually > being used. > We should keep this consistent so the user won't expect to find all > deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called
[ https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339590#comment-14339590 ] Apache Spark commented on SPARK-5771: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/4800 > Number of Cores in Completed Applications of Standalone Master Web Page > always be 0 if sc.stop() is called > -- > > Key: SPARK-5771 > URL: https://issues.apache.org/jira/browse/SPARK-5771 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.1 >Reporter: Liangliang Gu >Assignee: Liangliang Gu >Priority: Minor > Fix For: 1.4.0 > > > In Standalone mode, the number of cores in Completed Applications of the > Master Web Page will always be zero, if sc.stop() is called. > But the number will always be right, if sc.stop() is not called. > The reason maybe: > after sc.stop() is called, the function removeExecutor of class > ApplicationInfo will be called, thus reduce the variable coresGranted to > zero. The variable coresGranted is used to display the number of Cores on > the Web Page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4897) Python 3 support
[ https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-4897: - Assignee: Davies Liu > Python 3 support > > > Key: SPARK-4897 > URL: https://issues.apache.org/jira/browse/SPARK-4897 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Josh Rosen >Assignee: Davies Liu >Priority: Minor > > It would be nice to have Python 3 support in PySpark, provided that we can do > it in a way that maintains backwards-compatibility with Python 2.6. > I started looking into porting this; my WIP work can be found at > https://github.com/JoshRosen/spark/compare/python3 > I was able to use the > [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] > tool to handle the basic conversion of things like {{print}} statements, etc. > and had to manually fix up a few imports for packages that moved / were > renamed, but the major blocker that I hit was {{cloudpickle}}: > {code} > [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark > Python 3.4.2 (default, Oct 19 2014, 17:52:17) > [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Traceback (most recent call last): > File "/Users/joshrosen/Documents/Spark/python/pyspark/shell.py", line 28, > in > import pyspark > File "/Users/joshrosen/Documents/spark/python/pyspark/__init__.py", line > 41, in > from pyspark.context import SparkContext > File "/Users/joshrosen/Documents/spark/python/pyspark/context.py", line 26, > in > from pyspark import accumulators > File "/Users/joshrosen/Documents/spark/python/pyspark/accumulators.py", > line 97, in > from pyspark.cloudpickle import CloudPickler > File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line > 120, in > class CloudPickler(pickle.Pickler): > File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line > 122, in CloudPickler > dispatch = pickle.Pickler.dispatch.copy() > AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' > {code} > This code looks like it will be hard difficult to port to Python 3, so this > might be a good reason to switch to > [Dill|https://github.com/uqfoundation/dill] for Python serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api
[ https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339557#comment-14339557 ] Liang-Chi Hsieh commented on SPARK-5950: Let me explain on github pull. > Insert array into a metastore table saved as parquet should work when using > datasource api > -- > > Key: SPARK-5950 > URL: https://issues.apache.org/jira/browse/SPARK-5950 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4579) Scheduling Delay appears negative
[ https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4579. Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 Assignee: Sean Owen (was: Andrew Or) Target Version/s: 1.3.0, 1.2.2 > Scheduling Delay appears negative > - > > Key: SPARK-4579 > URL: https://issues.apache.org/jira/browse/SPARK-4579 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Arun Ahuja >Assignee: Sean Owen > Fix For: 1.3.0, 1.2.2 > > > !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
[ https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339544#comment-14339544 ] Andrew Or commented on SPARK-6048: -- [~vanzin] Note that (3) is orthogonal to this change. We can still do all the warnings at the beginning rather than later. However I still don't see why warnings should necessarily be tied to when keys are set, though that is a separate discussion. For (2), yes we can just fix remove(), but doing so means duplicating the translation and keeping track of one more place where the translation takes place. In the future if we add more methods to SparkConf, we'll have to remember to do the translation otherwise it won't work correctly. I personally find limiting the scope of translation much cleaner. (1) Maybe it's unlikely, but it breaks existing user behavior in a confounding way nevertheless. When it fails it will be extremely difficult to debug which value is used without doing some querying of the conf itself. > SparkConf.translateConfKey should translate on get, not set > --- > > Key: SPARK-6048 > URL: https://issues.apache.org/jira/browse/SPARK-6048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > There are several issues with translating on set. > (1) The most serious one is that if the user has both the deprecated and the > latest version of the same config set, then the value picked up by SparkConf > will be arbitrary. Why? Because during initialization of the conf we call > `conf.set` on each property in `sys.props` in an order arbitrarily defined by > Java. As a result, the value of the more recent config may be overridden by > that of the deprecated one. Instead, we should always use the value of the > most recent config. > (2) If we translate on set, then we must keep translating everywhere else. In > fact, the current code does not translate on remove, which means the > following won't work if X is deprecated: > {code} > conf.set(X, Y) > conf.remove(X) // X is not in the conf > {code} > This requires us to also translate in remove and other places, as we already > do for contains, leading to more duplicate code. > (3) Since we call `conf.set` on all configs when initializing the conf, we > print all deprecation warnings in the beginning. Elsewhere in Spark, however, > we warn the user when the deprecated config / option / env var is actually > being used. > We should keep this consistent so the user won't expect to find all > deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api
[ https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339542#comment-14339542 ] Yin Huai commented on SPARK-5950: - If it is just the part of the problem, mind explain what is the problem? > Insert array into a metastore table saved as parquet should work when using > datasource api > -- > > Key: SPARK-5950 > URL: https://issues.apache.org/jira/browse/SPARK-5950 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api
[ https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339534#comment-14339534 ] Liang-Chi Hsieh commented on SPARK-5950: Yes. This is just the part of the problem. containsNull/valueContainsNull is most the problem. nullable should not be a problem. > Insert array into a metastore table saved as parquet should work when using > datasource api > -- > > Key: SPARK-5950 > URL: https://issues.apache.org/jira/browse/SPARK-5950 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API
[ https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5508: Summary: Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API (was: [hive context] java.lang.IndexOutOfBoundsException: Index: 0, Size: 0) > Arrays and Maps stored with Hive Parquet Serde may not be able to read by the > Parquet support in the Data Souce API > --- > > Key: SPARK-5508 > URL: https://issues.apache.org/jira/browse/SPARK-5508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 > Environment: mesos, cdh >Reporter: Ayoub Benali > Labels: hivecontext, parquet > > When the table is saved as parquet, we cannot query a field which is an array > of struct after an INSERT statement, like show bellow: > {noformat} > scala> val data1="""{ > | "timestamp": 1422435598, > | "data_array": [ > | { > | "field1": 1, > | "field2": 2 > | } > | ] > | }""" > scala> val data2="""{ > | "timestamp": 1422435598, > | "data_array": [ > | { > | "field1": 3, > | "field2": 4 > | } > | ] > scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) > scala> val rdd = hiveContext.jsonRDD(jsonRDD) > scala> rdd.printSchema > root > |-- data_array: array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- field1: integer (nullable = true) > |||-- field2: integer (nullable = true) > |-- timestamp: integer (nullable = true) > scala> rdd.registerTempTable("tmp_table") > scala> hiveContext.sql("select data.field1 from tmp_table LATERAL VIEW > explode(data_array) nestedStuff AS data").collect > res3: Array[org.apache.spark.sql.Row] = Array([1], [3]) > scala> hiveContext.sql("SET hive.exec.dynamic.partition = true") > scala> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") > scala> hiveContext.sql("set parquet.compression=GZIP") > scala> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") > scala> hiveContext.sql("create external table if not exists > persisted_table(data_array ARRAY >, > timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table'") > scala> hiveContext.sql("insert into table persisted_table select * from > tmp_table").collect > scala> hiveContext.sql("select data.field1 from persisted_table LATERAL VIEW > explode(data_array) nestedStuff AS data").collect > parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in > file hdfs://*/test_table/part-1 > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) > at > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) > at > java.util.concurrent.ThreadPoolExecutor.ru
[jira] [Updated] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api
[ https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5950: Priority: Blocker (was: Major) > Insert array into a metastore table saved as parquet should work when using > datasource api > -- > > Key: SPARK-5950 > URL: https://issues.apache.org/jira/browse/SPARK-5950 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5508) [hive context] java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
[ https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5508: Target Version/s: 1.3.0 > [hive context] java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > - > > Key: SPARK-5508 > URL: https://issues.apache.org/jira/browse/SPARK-5508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 > Environment: mesos, cdh >Reporter: Ayoub Benali > Labels: hivecontext, parquet > > When the table is saved as parquet, we cannot query a field which is an array > of struct after an INSERT statement, like show bellow: > {noformat} > scala> val data1="""{ > | "timestamp": 1422435598, > | "data_array": [ > | { > | "field1": 1, > | "field2": 2 > | } > | ] > | }""" > scala> val data2="""{ > | "timestamp": 1422435598, > | "data_array": [ > | { > | "field1": 3, > | "field2": 4 > | } > | ] > scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) > scala> val rdd = hiveContext.jsonRDD(jsonRDD) > scala> rdd.printSchema > root > |-- data_array: array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- field1: integer (nullable = true) > |||-- field2: integer (nullable = true) > |-- timestamp: integer (nullable = true) > scala> rdd.registerTempTable("tmp_table") > scala> hiveContext.sql("select data.field1 from tmp_table LATERAL VIEW > explode(data_array) nestedStuff AS data").collect > res3: Array[org.apache.spark.sql.Row] = Array([1], [3]) > scala> hiveContext.sql("SET hive.exec.dynamic.partition = true") > scala> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") > scala> hiveContext.sql("set parquet.compression=GZIP") > scala> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") > scala> hiveContext.sql("create external table if not exists > persisted_table(data_array ARRAY >, > timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table'") > scala> hiveContext.sql("insert into table persisted_table select * from > tmp_table").collect > scala> hiveContext.sql("select data.field1 from persisted_table LATERAL VIEW > explode(data_array) nestedStuff AS data").collect > parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in > file hdfs://*/test_table/part-1 > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) > at > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.ArrayList.rangeChe
[jira] [Commented] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api
[ https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339530#comment-14339530 ] Yin Huai commented on SPARK-5950: - OK. Now, I understand what's going on. For this JIRA, the table is a MetastoreRelation and we are trying to use data source API's write path to insert into it. For a MetastoreRelation, containsNull, valueContainsNull and nullable will always be true. When we try to insert into this table through the data source write path, if any of containsNull/valueContainsNull/nullable is false, InsertIntoTable will not be resolved because of the nullability issue. > Insert array into a metastore table saved as parquet should work when using > datasource api > -- > > Key: SPARK-5950 > URL: https://issues.apache.org/jira/browse/SPARK-5950 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5508) [hive context] java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
[ https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reopened SPARK-5508: - I am reopening it since it is different from SPARK-5950. > [hive context] java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > - > > Key: SPARK-5508 > URL: https://issues.apache.org/jira/browse/SPARK-5508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 > Environment: mesos, cdh >Reporter: Ayoub Benali > Labels: hivecontext, parquet > > When the table is saved as parquet, we cannot query a field which is an array > of struct after an INSERT statement, like show bellow: > {noformat} > scala> val data1="""{ > | "timestamp": 1422435598, > | "data_array": [ > | { > | "field1": 1, > | "field2": 2 > | } > | ] > | }""" > scala> val data2="""{ > | "timestamp": 1422435598, > | "data_array": [ > | { > | "field1": 3, > | "field2": 4 > | } > | ] > scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) > scala> val rdd = hiveContext.jsonRDD(jsonRDD) > scala> rdd.printSchema > root > |-- data_array: array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- field1: integer (nullable = true) > |||-- field2: integer (nullable = true) > |-- timestamp: integer (nullable = true) > scala> rdd.registerTempTable("tmp_table") > scala> hiveContext.sql("select data.field1 from tmp_table LATERAL VIEW > explode(data_array) nestedStuff AS data").collect > res3: Array[org.apache.spark.sql.Row] = Array([1], [3]) > scala> hiveContext.sql("SET hive.exec.dynamic.partition = true") > scala> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") > scala> hiveContext.sql("set parquet.compression=GZIP") > scala> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") > scala> hiveContext.sql("create external table if not exists > persisted_table(data_array ARRAY >, > timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table'") > scala> hiveContext.sql("insert into table persisted_table select * from > tmp_table").collect > scala> hiveContext.sql("select data.field1 from persisted_table LATERAL VIEW > explode(data_array) nestedStuff AS data").collect > parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in > file hdfs://*/test_table/part-1 > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) > at > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 >
[jira] [Updated] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api
[ https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5950: Summary: Insert array into a metastore table saved as parquet should work when using datasource api (was: Insert array into a metastore table should work when using datasource api) > Insert array into a metastore table saved as parquet should work when using > datasource api > -- > > Key: SPARK-5950 > URL: https://issues.apache.org/jira/browse/SPARK-5950 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5950) Insert array into a metastore table should work when using datasource api
[ https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5950: Summary: Insert array into a metastore table should work when using datasource api (was: Insert array into a metastore table saved as parquet should work when using datasource api) > Insert array into a metastore table should work when using datasource api > - > > Key: SPARK-5950 > URL: https://issues.apache.org/jira/browse/SPARK-5950 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
[ https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6048: - Priority: Blocker (was: Critical) > SparkConf.translateConfKey should translate on get, not set > --- > > Key: SPARK-6048 > URL: https://issues.apache.org/jira/browse/SPARK-6048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > > There are several issues with translating on set. > (1) The most serious one is that if the user has both the deprecated and the > latest version of the same config set, then the value picked up by SparkConf > will be arbitrary. Why? Because during initialization of the conf we call > `conf.set` on each property in `sys.props` in an order arbitrarily defined by > Java. As a result, the value of the more recent config may be overridden by > that of the deprecated one. Instead, we should always use the value of the > most recent config. > (2) If we translate on set, then we must keep translating everywhere else. In > fact, the current code does not translate on remove, which means the > following won't work if X is deprecated: > {code} > conf.set(X, Y) > conf.remove(X) // X is not in the conf > {code} > This requires us to also translate in remove and other places, as we already > do for contains, leading to more duplicate code. > (3) Since we call `conf.set` on all configs when initializing the conf, we > print all deprecation warnings in the beginning. Elsewhere in Spark, however, > we warn the user when the deprecated config / option / env var is actually > being used. > We should keep this consistent so the user won't expect to find all > deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api
[ https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5950: Summary: Insert array into a metastore table saved as parquet should work when using datasource api (was: Insert array into table saved as parquet should work when using datasource api) > Insert array into a metastore table saved as parquet should work when using > datasource api > -- > > Key: SPARK-5950 > URL: https://issues.apache.org/jira/browse/SPARK-5950 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5950) Insert array into table saved as parquet should work when using datasource api
[ https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5950: Summary: Insert array into table saved as parquet should work when using datasource api (was: Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API ) > Insert array into table saved as parquet should work when using datasource api > -- > > Key: SPARK-5950 > URL: https://issues.apache.org/jira/browse/SPARK-5950 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
[ https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339519#comment-14339519 ] Marcelo Vanzin commented on SPARK-6048: --- I sort of agree with (1). But I think it's both unlikely (users will probably use the old option or the new one, but not both), and probably sort of fixable (but not optimally). Basically, don't override a value that's already set when using the deprecated key. I disagree with (2). Just fix remove(). I also disagree with (3), and it's not even the correct interpretation of what happens. Warning *only* happen when the configuration keys are set, never when reading. And I think it's actually a good thing that all (or most) of the warnings show up when creating the conf object, which generally happens early in the app's life. It means it's easier to see them. > SparkConf.translateConfKey should translate on get, not set > --- > > Key: SPARK-6048 > URL: https://issues.apache.org/jira/browse/SPARK-6048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > There are several issues with translating on set. > (1) The most serious one is that if the user has both the deprecated and the > latest version of the same config set, then the value picked up by SparkConf > will be arbitrary. Why? Because during initialization of the conf we call > `conf.set` on each property in `sys.props` in an order arbitrarily defined by > Java. As a result, the value of the more recent config may be overridden by > that of the deprecated one. Instead, we should always use the value of the > most recent config. > (2) If we translate on set, then we must keep translating everywhere else. In > fact, the current code does not translate on remove, which means the > following won't work if X is deprecated: > {code} > conf.set(X, Y) > conf.remove(X) // X is not in the conf > {code} > This requires us to also translate in remove and other places, as we already > do for contains, leading to more duplicate code. > (3) Since we call `conf.set` on all configs when initializing the conf, we > print all deprecation warnings in the beginning. Elsewhere in Spark, however, > we warn the user when the deprecated config / option / env var is actually > being used. > We should keep this consistent so the user won't expect to find all > deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
[ https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6048: - Description: There are several issues with translating on set. (1) The most serious one is that if the user has both the deprecated and the latest version of the same config set, then the value picked up by SparkConf will be arbitrary. Why? Because during initialization of the conf we call `conf.set` on each property in `sys.props` in an order arbitrarily defined by Java. As a result, the value of the more recent config may be overridden by that of the deprecated one. Instead, we should always use the value of the most recent config. (2) If we translate on set, then we must keep translating everywhere else. In fact, the current code does not translate on remove, which means the following won't work if X is deprecated: {code} conf.set(X, Y) conf.remove(X) // X is not in the conf {code} This requires us to also translate in remove and other places, as we already do for contains, leading to more duplicate code. (3) Since we call `conf.set` on all configs when initializing the conf, we print all deprecation warnings in the beginning. Elsewhere in Spark, however, we warn the user when the deprecated config / option / env var is actually being used. We should keep this consistent so the user won't expect to find all deprecation messages in the beginning of his logs. was: There are several issues with translating on set. (1) The most serious one is that if the user has both the deprecated and the latest version of the same config set, then the value picked up by SparkConf will be arbitrary. Why? Because during initialization of the conf we call `conf.set` on each property in `sys.props` in an order arbitrarily defined by Java. Instead, we should always use the value of the latest version of the config if that is provided. (2) If we translate on set, then we must keep translating everywhere else. In fact, the current code does not translate on remove, which means the following won't work if X is deprecated: {code} conf.set(X, Y) conf.remove(X) // X is not in the conf {code} This requires us to also translate in remove and other places, as we already do for contains, leading to more duplicate code. (3) Since we call `conf.set` on all configs when initializing the conf, we print all deprecation warnings in the beginning. Elsewhere in Spark, however, we warn the user when the deprecated config / option / env var is actually being used. We should keep this consistent so the user won't expect to find all deprecation messages in the beginning of his logs. > SparkConf.translateConfKey should translate on get, not set > --- > > Key: SPARK-6048 > URL: https://issues.apache.org/jira/browse/SPARK-6048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > There are several issues with translating on set. > (1) The most serious one is that if the user has both the deprecated and the > latest version of the same config set, then the value picked up by SparkConf > will be arbitrary. Why? Because during initialization of the conf we call > `conf.set` on each property in `sys.props` in an order arbitrarily defined by > Java. As a result, the value of the more recent config may be overridden by > that of the deprecated one. Instead, we should always use the value of the > most recent config. > (2) If we translate on set, then we must keep translating everywhere else. In > fact, the current code does not translate on remove, which means the > following won't work if X is deprecated: > {code} > conf.set(X, Y) > conf.remove(X) // X is not in the conf > {code} > This requires us to also translate in remove and other places, as we already > do for contains, leading to more duplicate code. > (3) Since we call `conf.set` on all configs when initializing the conf, we > print all deprecation warnings in the beginning. Elsewhere in Spark, however, > we warn the user when the deprecated config / option / env var is actually > being used. > We should keep this consistent so the user won't expect to find all > deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
[ https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339498#comment-14339498 ] Apache Spark commented on SPARK-6048: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/4799 > SparkConf.translateConfKey should translate on get, not set > --- > > Key: SPARK-6048 > URL: https://issues.apache.org/jira/browse/SPARK-6048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > There are several issues with translating on set. > (1) The most serious one is that if the user has both the deprecated and the > latest version of the same config set, then the value picked up by SparkConf > will be arbitrary. Why? Because during initialization of the conf we call > `conf.set` on each property in `sys.props` in an order arbitrarily defined by > Java. Instead, we should always use the value of the latest version of the > config if that is provided. > (2) If we translate on set, then we must keep translating everywhere else. In > fact, the current code does not translate on remove, which means the > following won't work if X is deprecated: > {code} > conf.set(X, Y) > conf.remove(X) // X is not in the conf > {code} > (3) Since we call `conf.set` on all configs when initializing the conf, we > print all deprecation warnings in the beginning. Elsewhere in Spark, however, > we warn the user when the deprecated config / option / env var is actually > being used. > We should keep this consistent so the user won't expect to find all > deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
[ https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6048: - Description: There are several issues with translating on set. (1) The most serious one is that if the user has both the deprecated and the latest version of the same config set, then the value picked up by SparkConf will be arbitrary. Why? Because during initialization of the conf we call `conf.set` on each property in `sys.props` in an order arbitrarily defined by Java. Instead, we should always use the value of the latest version of the config if that is provided. (2) If we translate on set, then we must keep translating everywhere else. In fact, the current code does not translate on remove, which means the following won't work if X is deprecated: {code} conf.set(X, Y) conf.remove(X) // X is not in the conf {code} This requires us to also translate in remove and other places, as we already do for contains, leading to more duplicate code. (3) Since we call `conf.set` on all configs when initializing the conf, we print all deprecation warnings in the beginning. Elsewhere in Spark, however, we warn the user when the deprecated config / option / env var is actually being used. We should keep this consistent so the user won't expect to find all deprecation messages in the beginning of his logs. was: There are several issues with translating on set. (1) The most serious one is that if the user has both the deprecated and the latest version of the same config set, then the value picked up by SparkConf will be arbitrary. Why? Because during initialization of the conf we call `conf.set` on each property in `sys.props` in an order arbitrarily defined by Java. Instead, we should always use the value of the latest version of the config if that is provided. (2) If we translate on set, then we must keep translating everywhere else. In fact, the current code does not translate on remove, which means the following won't work if X is deprecated: {code} conf.set(X, Y) conf.remove(X) // X is not in the conf {code} (3) Since we call `conf.set` on all configs when initializing the conf, we print all deprecation warnings in the beginning. Elsewhere in Spark, however, we warn the user when the deprecated config / option / env var is actually being used. We should keep this consistent so the user won't expect to find all deprecation messages in the beginning of his logs. > SparkConf.translateConfKey should translate on get, not set > --- > > Key: SPARK-6048 > URL: https://issues.apache.org/jira/browse/SPARK-6048 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > There are several issues with translating on set. > (1) The most serious one is that if the user has both the deprecated and the > latest version of the same config set, then the value picked up by SparkConf > will be arbitrary. Why? Because during initialization of the conf we call > `conf.set` on each property in `sys.props` in an order arbitrarily defined by > Java. Instead, we should always use the value of the latest version of the > config if that is provided. > (2) If we translate on set, then we must keep translating everywhere else. In > fact, the current code does not translate on remove, which means the > following won't work if X is deprecated: > {code} > conf.set(X, Y) > conf.remove(X) // X is not in the conf > {code} > This requires us to also translate in remove and other places, as we already > do for contains, leading to more duplicate code. > (3) Since we call `conf.set` on all configs when initializing the conf, we > print all deprecation warnings in the beginning. Elsewhere in Spark, however, > we warn the user when the deprecated config / option / env var is actually > being used. > We should keep this consistent so the user won't expect to find all > deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set
Andrew Or created SPARK-6048: Summary: SparkConf.translateConfKey should translate on get, not set Key: SPARK-6048 URL: https://issues.apache.org/jira/browse/SPARK-6048 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical There are several issues with translating on set. (1) The most serious one is that if the user has both the deprecated and the latest version of the same config set, then the value picked up by SparkConf will be arbitrary. Why? Because during initialization of the conf we call `conf.set` on each property in `sys.props` in an order arbitrarily defined by Java. Instead, we should always use the value of the latest version of the config if that is provided. (2) If we translate on set, then we must keep translating everywhere else. In fact, the current code does not translate on remove, which means the following won't work if X is deprecated: {code} conf.set(X, Y) conf.remove(X) // X is not in the conf {code} (3) Since we call `conf.set` on all configs when initializing the conf, we print all deprecation warnings in the beginning. Elsewhere in Spark, however, we warn the user when the deprecated config / option / env var is actually being used. We should keep this consistent so the user won't expect to find all deprecation messages in the beginning of his logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339457#comment-14339457 ] Apache Spark commented on SPARK-5775: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4798 > GenericRow cannot be cast to SpecificMutableRow when nested data and > partitioned table > -- > > Key: SPARK-5775 > URL: https://issues.apache.org/jira/browse/SPARK-5775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Ayoub Benali >Assignee: Cheng Lian >Priority: Blocker > Labels: hivecontext, nested, parquet, partition > > Using the "LOAD" sql command in Hive context to load parquet files into a > partitioned table causes exceptions during query time. > The bug requires the table to have a column of *type Array of struct* and to > be *partitioned*. > The example bellow shows how to reproduce the bug and you can see that if the > table is not partitioned the query works fine. > {noformat} > scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}""" > scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}""" > scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) > scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD) > scala> schemaRDD.printSchema > root > |-- data_array: array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- field1: integer (nullable = true) > |||-- field2: integer (nullable = true) > scala> hiveContext.sql("create external table if not exists > partitioned_table(data_array ARRAY >) > Partitioned by (date STRING) STORED AS PARQUET Location > 'hdfs:///partitioned_table'") > scala> hiveContext.sql("create external table if not exists > none_partitioned_table(data_array ARRAY >) > STORED AS PARQUET Location 'hdfs:///none_partitioned_table'") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE > partitioned_table PARTITION(date='2015-02-12')") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE > none_partitioned_table") > scala> hiveContext.sql("select data.field1 from none_partitioned_table > LATERAL VIEW explode(data_array) nestedStuff AS data").collect > res23: Array[org.apache.spark.sql.Row] = Array([1], [3]) > scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL > VIEW explode(data_array) nestedStuff AS data").collect > 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from > partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data > 15/02/12 16:21:03 INFO ParseDriver: Parse Completed > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with > curMem=0, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in > memory (estimated size 254.6 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with > curMem=260661, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes > in memory (estimated size 27.9 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory > on *:51990 (size: 27.9 KB, free: 267.2 MB) > 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block > broadcast_18_piece0 > 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD > at ParquetTableOperations.scala:119 > 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side > Metadata Split Strategy > 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at > SparkPlan.scala:84 > 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at > SparkPlan.scala:84) with 3 output partitions (allowLocal=false) > 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at > SparkPlan.scala:84) > 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List() > 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List() > 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at > map at SparkPlan.scala:84), which has no missing parents > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with > curMem=289276, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in > memory (estimated size 7.5 KB, free 267.0 MB) >
[jira] [Commented] (SPARK-6047) pyspark - class loading on driver failing with --jars and --packages
[ https://issues.apache.org/jira/browse/SPARK-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339453#comment-14339453 ] Apache Spark commented on SPARK-6047: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/4754 > pyspark - class loading on driver failing with --jars and --packages > > > Key: SPARK-6047 > URL: https://issues.apache.org/jira/browse/SPARK-6047 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 1.3.0 >Reporter: Burak Yavuz > > Because py4j uses the system ClassLoader instead of the contextClassLoader of > the thread, the dynamically added jars in Spark Submit can't be loaded in the > driver. > This causes `Py4JError: Trying to call a package` errors. > Usually `--packages` are downloaded from some remote repo before runtime, > adding them explicitly to `--driver-class-path` is not an option, like we can > do with `--jars`. One solution is to move the fetching of `--packages` to the > SparkSubmitDriverBootstrapper, and add it to the driver class-path there. > A more complete solution can be achieved through [SPARK-4924]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6047) pyspark - class loading on driver failing with --jars and --packages
[ https://issues.apache.org/jira/browse/SPARK-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-6047: --- Description: Because py4j uses the system ClassLoader instead of the contextClassLoader of the thread, the dynamically added jars in Spark Submit can't be loaded in the driver. This causes `Py4JError: Trying to call a package` errors. Usually `--packages` are downloaded from some remote repo before runtime, adding them explicitly to `--driver-class-path` is not an option, like we can do with `--jars`. One solution is to move the fetching of `--packages` to the SparkSubmitDriverBootstrapper, and add it to the driver class-path there. A more complete solution can be achieved through [SPARK-4924]. was: Because py4j uses the system ClassLoader instead of the contextClassLoader of the thread, the dynamically added jars in Spark Submit can't be loaded in the driver. This causes "package not found" errors in py4j. > pyspark - class loading on driver failing with --jars and --packages > > > Key: SPARK-6047 > URL: https://issues.apache.org/jira/browse/SPARK-6047 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 1.3.0 >Reporter: Burak Yavuz > > Because py4j uses the system ClassLoader instead of the contextClassLoader of > the thread, the dynamically added jars in Spark Submit can't be loaded in the > driver. > This causes `Py4JError: Trying to call a package` errors. > Usually `--packages` are downloaded from some remote repo before runtime, > adding them explicitly to `--driver-class-path` is not an option, like we can > do with `--jars`. One solution is to move the fetching of `--packages` to the > SparkSubmitDriverBootstrapper, and add it to the driver class-path there. > A more complete solution can be achieved through [SPARK-4924]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6047) pyspark - class loading on driver failing with --jars and --packages
Burak Yavuz created SPARK-6047: -- Summary: pyspark - class loading on driver failing with --jars and --packages Key: SPARK-6047 URL: https://issues.apache.org/jira/browse/SPARK-6047 Project: Spark Issue Type: Bug Components: PySpark, Spark Submit Affects Versions: 1.3.0 Reporter: Burak Yavuz Because py4j uses the system ClassLoader instead of the contextClassLoader of the thread, the dynamically added jars in Spark Submit can't be loaded in the driver. This causes "package not found" errors in py4j. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5942) DataFrame should not do query optimization when dataFrameEagerAnalysis is off
[ https://issues.apache.org/jira/browse/SPARK-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-5942. -- Resolution: Won't Fix > DataFrame should not do query optimization when dataFrameEagerAnalysis is off > - > > Key: SPARK-5942 > URL: https://issues.apache.org/jira/browse/SPARK-5942 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > DataFrame will force query optimization to happen right away for the commands > and queries with side effects. > However, I think we should not do that when dataFrameEagerAnalysis is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2989) Error sending message to BlockManagerMaster
[ https://issues.apache.org/jira/browse/SPARK-2989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2989: - Component/s: (was: Deploy) Priority: Major (was: Critical) I'm not sure if there's enough info here. This basically says the executor couldn't talk to the block manager. Do you have any more detail? this itself isn't the error, but some underlying cause. > Error sending message to BlockManagerMaster > --- > > Key: SPARK-2989 > URL: https://issues.apache.org/jira/browse/SPARK-2989 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.0.2 >Reporter: pengyanhong > > run a simple hive sql Spark App via yarn-cluster, got 3 segments log content > via yarn logs --applicationID command line, the detail as below: > * 1st segment is about the Driver & Application Master, everything is fine > without error, start time is 16:43:49 and end time is 16:44:08. > * 2nd & 3rd segment is about the Executor, the start time is 16:43:52, then > from 16:44:38 encounter many times error as below: > {quote} > WARN org.apache.spark.Logging$class.logWarning(Logging.scala:91): Error > sending message to BlockManagerMaster in 1 attempts > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:237) > at > org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:51) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:113) > at > org.apache.spark.storage.BlockManager$$anonfun$initialize$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(BlockManager.scala:158) > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) > at > org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:158) > at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) > at > akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > 14/08/12 16:45:31 WARN > org.apache.spark.Logging$class.logWarning(Logging.scala:91): Error sending > message to BlockManagerMaster in 2 attempts > .. > {quote} > confirmed that the date time of 3 nodes is sync. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.
[ https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5066. -- Resolution: Not a Problem > Can not get all key that has same hashcode when reading key ordered from > different Streaming. > --- > > Key: SPARK-5066 > URL: https://issues.apache.org/jira/browse/SPARK-5066 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 >Reporter: DoingDone9 >Priority: Critical > > when spill is open, data ordered by hashCode will be spilled to disk. We need > get all key that has the same hashCode from different tmp files when merge > value, but it just read the key that has the minHashCode that in a tmp file, > we can not read all key. > Example : > If file1 has [k1, k2, k3], file2 has [k4,k5,k1]. > And hashcode of k4 < hashcode of k5 < hashcode of k1 < hashcode of k2 < > hashcode of k3 > we just read k1 from file1 and k4 from file2. Can not read all k1. > Code : > private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it => > it.buffered) > inputStreams.foreach { it => > val kcPairs = new ArrayBuffer[(K, C)] > readNextHashCode(it, kcPairs) > if (kcPairs.length > 0) { > mergeHeap.enqueue(new StreamBuffer(it, kcPairs)) > } > } > private def readNextHashCode(it: BufferedIterator[(K, C)], buf: > ArrayBuffer[(K, C)]): Unit = { > if (it.hasNext) { > var kc = it.next() > buf += kc > val minHash = hashKey(kc) > while (it.hasNext && it.head._1.hashCode() == minHash) { > kc = it.next() > buf += kc > } > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2356: - Component/s: (was: Spark Core) Windows > Exception: Could not locate executable null\bin\winutils.exe in the Hadoop > --- > > Key: SPARK-2356 > URL: https://issues.apache.org/jira/browse/SPARK-2356 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 1.0.0 >Reporter: Kostiantyn Kudriavtsev >Priority: Critical > > I'm trying to run some transformation on Spark, it works fine on cluster > (YARN, linux machines). However, when I'm trying to run it on local machine > (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file > from local filesystem): > {code} > 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) > at org.apache.hadoop.util.Shell.(Shell.java:326) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:76) > at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) > at org.apache.hadoop.security.Groups.(Groups.java:77) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) > at org.apache.spark.SparkContext.(SparkContext.scala:228) > at org.apache.spark.SparkContext.(SparkContext.scala:97) > {code} > It's happened because Hadoop config is initialized each time when spark > context is created regardless is hadoop required or not. > I propose to add some special flag to indicate if hadoop config is required > (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error
[ https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2348: - Component/s: (was: Spark Core) Windows Priority: Major (was: Critical) I think that in general you shouldn't have a global CLASSPATH env variable set (in any platform). Hm, why would you want Scala to use it? I'm not getting why that's the fix. > In Windows having a enviorinment variable named 'classpath' gives error > --- > > Key: SPARK-2348 > URL: https://issues.apache.org/jira/browse/SPARK-2348 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 1.0.0 > Environment: Windows 7 Enterprise >Reporter: Chirag Todarka >Assignee: Chirag Todarka > > Operating System:: Windows 7 Enterprise > If having enviorinment variable named 'classpath' gives then starting > 'spark-shell' gives below error:: > \spark\bin>spark-shell > Failed to initialize compiler: object scala.runtime in compiler mirror not > found > . > ** Note that as of 2.8 scala does not assume use of the java classpath. > ** For the old behavior pass -usejavacp to scala, or if using a Settings > ** object programatically, settings.usejavacp.value = true. > 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler > acces > sed before init set up. Assuming no postInit code. > Failed to initialize compiler: object scala.runtime in compiler mirror not > found > . > ** Note that as of 2.8 scala does not assume use of the java classpath. > ** For the old behavior pass -usejavacp to scala, or if using a Settings > ** object programatically, settings.usejavacp.value = true. > Exception in thread "main" java.lang.AssertionError: assertion failed: null > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca > la:202) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar > kILoop.scala:929) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. > scala:884) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. > scala:884) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass > Loader.scala:135) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6046) Provide an easier way for developers to handle deprecated configs
[ https://issues.apache.org/jira/browse/SPARK-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339420#comment-14339420 ] Apache Spark commented on SPARK-6046: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/4797 > Provide an easier way for developers to handle deprecated configs > - > > Key: SPARK-6046 > URL: https://issues.apache.org/jira/browse/SPARK-6046 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > Right now we have code that looks like this: > https://github.com/apache/spark/blob/8942b522d8a3269a2a357e3a274ed4b3e66ebdde/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L52 > where a random class calls `SparkConf.translateConfKey` to warn the user > against a deprecated configs. We should refactor this slightly so we can make > `translateConfKey` private instead of calling it from everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6043) Error when trying to rename table with alter table after using INSERT OVERWITE to populate the table
[ https://issues.apache.org/jira/browse/SPARK-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6043: - Component/s: SQL > Error when trying to rename table with alter table after using INSERT > OVERWITE to populate the table > > > Key: SPARK-6043 > URL: https://issues.apache.org/jira/browse/SPARK-6043 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Trystan Leftwich >Priority: Minor > > If you populate a table using INSERT OVERWRITE and then try to rename the > table using alter table it fails with: > {noformat} > Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: > Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. > Unable to alter table. (state=,code=0) > {noformat} > Using the following SQL statement creates the error: > {code:sql} > CREATE TABLE `tmp_table` (salesamount_c1 DOUBLE); > INSERT OVERWRITE table tmp_table SELECT >MIN(sales_customer.salesamount) salesamount_c1 > FROM > ( > SELECT > SUM(sales.salesamount) salesamount > FROM > internalsales sales > ) sales_customer; > ALTER TABLE tmp_table RENAME to not_tmp; > {code} > But if you change the 'OVERWRITE' to be 'INTO' the SQL statement works. > This is happening on our CDH5.3 cluster with multiple workers, If we use the > CDH5.3 Quickstart VM the SQL does not produce an error. Both cases were spark > 1.2.1 built for hadoop2.4+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6046) Provide an easier way for developers to handle deprecated configs
[ https://issues.apache.org/jira/browse/SPARK-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6046: - Priority: Minor (was: Major) > Provide an easier way for developers to handle deprecated configs > - > > Key: SPARK-6046 > URL: https://issues.apache.org/jira/browse/SPARK-6046 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > Right now we have code that looks like this: > https://github.com/apache/spark/blob/8942b522d8a3269a2a357e3a274ed4b3e66ebdde/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L52 > where a random class calls `SparkConf.translateConfKey` to warn the user > against a deprecated configs. We should refactor this slightly so we can make > `translateConfKey` private instead of calling it from everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4579) Scheduling Delay appears negative
[ https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339412#comment-14339412 ] Apache Spark commented on SPARK-4579: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4796 > Scheduling Delay appears negative > - > > Key: SPARK-4579 > URL: https://issues.apache.org/jira/browse/SPARK-4579 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Arun Ahuja >Assignee: Andrew Or > > !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4579) Scheduling Delay appears negative
[ https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4579: - Priority: Major (was: Critical) > Scheduling Delay appears negative > - > > Key: SPARK-4579 > URL: https://issues.apache.org/jira/browse/SPARK-4579 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Arun Ahuja >Assignee: Andrew Or > > !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4571) History server shows negative time
[ https://issues.apache.org/jira/browse/SPARK-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4571. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Masayoshi TSUZUKI I'm quite sure this was solved as part of SPARK-2458, and this change: https://github.com/apache/spark/commit/6e74edeca31acd7dc84a34402e430e017591d858#diff-a19a4359f1a7f63bc020acf145664af4R132 > History server shows negative time > -- > > Key: SPARK-4571 > URL: https://issues.apache.org/jira/browse/SPARK-4571 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Masayoshi TSUZUKI > Fix For: 1.3.0 > > Attachments: Screen Shot 2014-11-21 at 2.49.25 PM.png > > > See attachment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6045) RecordWriter should be checked against null in PairRDDFunctions#saveAsNewAPIHadoopDataset
[ https://issues.apache.org/jira/browse/SPARK-6045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6045: - Component/s: Input/Output Assignee: Ted Yu > RecordWriter should be checked against null in > PairRDDFunctions#saveAsNewAPIHadoopDataset > - > > Key: SPARK-6045 > URL: https://issues.apache.org/jira/browse/SPARK-6045 > Project: Spark > Issue Type: Bug > Components: Input/Output >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Trivial > Fix For: 1.4.0 > > > gtinside reported in the thread 'NullPointerException in TaskSetManager' with > the following stack trace: > {code} > WARN 2015-02-26 14:21:43,217 [task-result-getter-0] TaskSetManager - Lost > task 14.2 in stage 0.0 (TID 29, devntom003.dev.blackrock.com): > java.lang.NullPointerException > org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1007) > com.bfm.spark.test.CassandraHadoopMigrator$.main(CassandraHadoopMigrator.scala:77) > com.bfm.spark.test.CassandraHadoopMigrator.main(CassandraHadoopMigrator.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:606) > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > Looks like the following call in finally block was the cause: > {code} > writer.close(hadoopContext) > {code} > We should check writer against null before calling close(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6045) RecordWriter should be checked against null in PairRDDFunctions#saveAsNewAPIHadoopDataset
[ https://issues.apache.org/jira/browse/SPARK-6045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6045. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4794 [https://github.com/apache/spark/pull/4794] > RecordWriter should be checked against null in > PairRDDFunctions#saveAsNewAPIHadoopDataset > - > > Key: SPARK-6045 > URL: https://issues.apache.org/jira/browse/SPARK-6045 > Project: Spark > Issue Type: Bug >Reporter: Ted Yu >Priority: Trivial > Fix For: 1.4.0 > > > gtinside reported in the thread 'NullPointerException in TaskSetManager' with > the following stack trace: > {code} > WARN 2015-02-26 14:21:43,217 [task-result-getter-0] TaskSetManager - Lost > task 14.2 in stage 0.0 (TID 29, devntom003.dev.blackrock.com): > java.lang.NullPointerException > org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1007) > com.bfm.spark.test.CassandraHadoopMigrator$.main(CassandraHadoopMigrator.scala:77) > com.bfm.spark.test.CassandraHadoopMigrator.main(CassandraHadoopMigrator.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:606) > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > Looks like the following call in finally block was the cause: > {code} > writer.close(hadoopContext) > {code} > We should check writer against null before calling close(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5977) PySpark SPARK_CLASSPATH doesn't distribute jars to executors
[ https://issues.apache.org/jira/browse/SPARK-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339344#comment-14339344 ] Michael Nazario commented on SPARK-5977: I tried setting spark.executor.extraClassPath in the SparkConf and --driver-class-path in PYSPARK_SUBMIT_ARGS, and neither of those helped. > PySpark SPARK_CLASSPATH doesn't distribute jars to executors > > > Key: SPARK-5977 > URL: https://issues.apache.org/jira/browse/SPARK-5977 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.1 >Reporter: Michael Nazario > Labels: jars > > In PySpark 1.2.1, I added a jar for avro support similar to the one in > spark-examples. This jar I need to convert avro files into rows. However, in > the worker logs, I kept getting a ClassNotFoundException for my > AvroToPythonConverter class. > I double checked the jar to make sure the class was in there which it was. I > made sure I used the SPARK_CLASSPATH environment variable to place this jar > on the executor and driver classpaths. I then checked the application web UI > which also had this jar on both the executor and driver classpaths. > The final thing I tried was explicitly dropping the jars in the same location > as on my driver machine. That made the ClassNotFoundException go away. > This makes me think that the jars which back in 1.1.1 used to be sent to the > workers are no longer being sent over. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1673) GLMNET implementation in Spark
[ https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339331#comment-14339331 ] Joseph K. Bradley commented on SPARK-1673: -- That sounds good---I'll look forward to hearing how it does! > GLMNET implementation in Spark > -- > > Key: SPARK-1673 > URL: https://issues.apache.org/jira/browse/SPARK-1673 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Sung Chung > > This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, > Rob Tibshirani. > http://www.jstatsoft.org/v33/i01/paper > It's a straightforward implementation of the Coordinate-Descent based L1/L2 > regularized linear models, including Linear/Logistic/Multinomial regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1673) GLMNET implementation in Spark
[ https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339320#comment-14339320 ] mike bowles commented on SPARK-1673: Good discussion. I can see how it might be faster to propagate an approximate path as a way to provide good starting conditions for an accurate iteration. to some extent the accuracy of the glmnet path can be modulated by loosening the convergence criteria for the inner iteration (the iteration done to find the new minimum after the penalty parameter is decremented). The big time sink is making passes through the data. with glmnet regression the inner iterations don't require making passes through the data so they are much less expensive than the steps in the penalty parameter, which may provoke a pass through the data to deal with a new element being added to the active list. It would be interesting to see what happens if the active set of coefficients was constrained to change less frequently than the penalty parameter. I have a hunch that it might take more (inexpensive) inner iterations to converge when the coefficient were allowed to change, but it would save passes through the data. It would be relatively easy for us to implement this in our code. We can try only letting the active set change every other or every third step in the penalty parameter and see how much change it makes in the coefficient curves. Thanks for the idea. > GLMNET implementation in Spark > -- > > Key: SPARK-1673 > URL: https://issues.apache.org/jira/browse/SPARK-1673 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Sung Chung > > This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, > Rob Tibshirani. > http://www.jstatsoft.org/v33/i01/paper > It's a straightforward implementation of the Coordinate-Descent based L1/L2 > regularized linear models, including Linear/Logistic/Multinomial regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5951) Remove unreachable driver memory properties in yarn client mode (YarnClientSchedulerBackend)
[ https://issues.apache.org/jira/browse/SPARK-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5951. Resolution: Fixed Target Version/s: 1.3.0 > Remove unreachable driver memory properties in yarn client mode > (YarnClientSchedulerBackend) > > > Key: SPARK-5951 > URL: https://issues.apache.org/jira/browse/SPARK-5951 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 > Environment: yarn >Reporter: Shekhar Bansal >Assignee: Shekhar Bansal >Priority: Trivial > Fix For: 1.3.0 > > > In SPARK-4730 warning for deprecated was added > and in SPARK-1953 driver memory configs were removed in yarn client mode > During integration spark.master.memory and SPARK_MASTER_MEMORY were not > removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5951) Remove unreachable driver memory properties in yarn client mode (YarnClientSchedulerBackend)
[ https://issues.apache.org/jira/browse/SPARK-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5951: - Assignee: Shekhar Bansal > Remove unreachable driver memory properties in yarn client mode > (YarnClientSchedulerBackend) > > > Key: SPARK-5951 > URL: https://issues.apache.org/jira/browse/SPARK-5951 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 > Environment: yarn >Reporter: Shekhar Bansal >Assignee: Shekhar Bansal >Priority: Trivial > Fix For: 1.3.0 > > > In SPARK-4730 warning for deprecated was added > and in SPARK-1953 driver memory configs were removed in yarn client mode > During integration spark.master.memory and SPARK_MASTER_MEMORY were not > removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4300) Race condition during SparkWorker shutdown
[ https://issues.apache.org/jira/browse/SPARK-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4300: - Labels: backport-needed (was: ) > Race condition during SparkWorker shutdown > -- > > Key: SPARK-4300 > URL: https://issues.apache.org/jira/browse/SPARK-4300 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Alex Liu >Assignee: Sean Owen >Priority: Minor > Labels: backport-needed > Fix For: 1.2.2, 1.4.0 > > > When a shark job is done. there are some error message as following show in > the log > {code} > INFO 22:10:41,635 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,640 SparkMaster: Removing app app-20141106221014- > INFO 22:10:41,687 SparkMaster: Removing application > Shark::ip-172-31-11-204.us-west-1.compute.internal > INFO 22:10:41,710 SparkWorker: Asked to kill executor > app-20141106221014-/0 > INFO 22:10:41,712 SparkWorker: Runner thread for executor > app-20141106221014-/0 interrupted > INFO 22:10:41,714 SparkWorker: Killing process! > ERROR 22:10:41,738 SparkWorker: Error writing stream to file > /var/lib/spark/work/app-20141106221014-/0/stdout > ERROR 22:10:41,739 SparkWorker: java.io.IOException: Stream closed > ERROR 22:10:41,739 SparkWorker: at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162) > ERROR 22:10:41,740 SparkWorker: at > java.io.BufferedInputStream.read1(BufferedInputStream.java:272) > ERROR 22:10:41,740 SparkWorker: at > java.io.BufferedInputStream.read(BufferedInputStream.java:334) > ERROR 22:10:41,740 SparkWorker: at > java.io.FilterInputStream.read(FilterInputStream.java:107) > ERROR 22:10:41,741 SparkWorker: at > org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) > ERROR 22:10:41,741 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) > ERROR 22:10:41,741 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > ERROR 22:10:41,742 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > ERROR 22:10:41,742 SparkWorker: at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > ERROR 22:10:41,742 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) > INFO 22:10:41,838 SparkMaster: Connected to Cassandra cluster: 4299 > INFO 22:10:41,839 SparkMaster: Adding host 172.31.11.204 (Analytics) > INFO 22:10:41,840 SparkMaster: New Cassandra host /172.31.11.204:9042 added > INFO 22:10:41,841 SparkMaster: Adding host 172.31.11.204 (Analytics) > INFO 22:10:41,842 SparkMaster: Adding host 172.31.11.204 (Analytics) > INFO 22:10:41,852 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,853 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,853 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,857 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,862 SparkMaster: Adding host 172.31.11.204 (Analytics) > WARN 22:10:42,200 SparkMaster: Got status update for unknown executor > app-20141106221014-/0 > INFO 22:10:42,211 SparkWorker: Executor app-20141106221014-/0 finished > with state KILLED exitStatus 143 > {code} > /var/lib/spark/work/app-20141106221014-/0/stdout is on the disk. It is > trying to write to a close IO stream. > Spark worker shuts down by {code} > private def killProcess(message: Option[String]) { > var exitCode: Option[Int] = None > logInfo("Killing process!") > process.destroy() > process.waitFor() > if (stdoutAppender != null) { > stdoutAppender.stop() > } > if (stderrAppender != null) { > stderrAppender.stop() > } > if (process != null) { > exitCode = Some(process.waitFor()) > } > worker ! ExecutorStateChanged(appId, execId, state, message, exitCode) > > {code} > But stdoutAppender concurrently writes to output log file, which creates race > condition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional co
[jira] [Updated] (SPARK-794) Remove sleep() in ClusterScheduler.stop
[ https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-794: Target Version/s: (was: 1.2.1) Fix Version/s: 1.2.2 Assignee: Brennon York Labels: (was: backport-needed) Backported to 1.2 > Remove sleep() in ClusterScheduler.stop > --- > > Key: SPARK-794 > URL: https://issues.apache.org/jira/browse/SPARK-794 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 0.9.0 >Reporter: Matei Zaharia >Assignee: Brennon York > Fix For: 1.3.0, 1.2.2 > > > This temporary change made a while back slows down the unit tests quite a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-794) Remove sleep() in ClusterScheduler.stop
[ https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-794. - Resolution: Fixed > Remove sleep() in ClusterScheduler.stop > --- > > Key: SPARK-794 > URL: https://issues.apache.org/jira/browse/SPARK-794 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 0.9.0 >Reporter: Matei Zaharia >Assignee: Brennon York > Fix For: 1.3.0, 1.2.2 > > > This temporary change made a while back slows down the unit tests quite a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4300) Race condition during SparkWorker shutdown
[ https://issues.apache.org/jira/browse/SPARK-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4300. Resolution: Fixed Fix Version/s: 1.4.0 1.2.2 Assignee: Sean Owen Target Version/s: 1.3.0, 1.2.2, 1.4.0 > Race condition during SparkWorker shutdown > -- > > Key: SPARK-4300 > URL: https://issues.apache.org/jira/browse/SPARK-4300 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Alex Liu >Assignee: Sean Owen >Priority: Minor > Labels: backport-needed > Fix For: 1.2.2, 1.4.0 > > > When a shark job is done. there are some error message as following show in > the log > {code} > INFO 22:10:41,635 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,640 SparkMaster: Removing app app-20141106221014- > INFO 22:10:41,687 SparkMaster: Removing application > Shark::ip-172-31-11-204.us-west-1.compute.internal > INFO 22:10:41,710 SparkWorker: Asked to kill executor > app-20141106221014-/0 > INFO 22:10:41,712 SparkWorker: Runner thread for executor > app-20141106221014-/0 interrupted > INFO 22:10:41,714 SparkWorker: Killing process! > ERROR 22:10:41,738 SparkWorker: Error writing stream to file > /var/lib/spark/work/app-20141106221014-/0/stdout > ERROR 22:10:41,739 SparkWorker: java.io.IOException: Stream closed > ERROR 22:10:41,739 SparkWorker: at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162) > ERROR 22:10:41,740 SparkWorker: at > java.io.BufferedInputStream.read1(BufferedInputStream.java:272) > ERROR 22:10:41,740 SparkWorker: at > java.io.BufferedInputStream.read(BufferedInputStream.java:334) > ERROR 22:10:41,740 SparkWorker: at > java.io.FilterInputStream.read(FilterInputStream.java:107) > ERROR 22:10:41,741 SparkWorker: at > org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) > ERROR 22:10:41,741 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) > ERROR 22:10:41,741 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > ERROR 22:10:41,742 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > ERROR 22:10:41,742 SparkWorker: at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > ERROR 22:10:41,742 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) > INFO 22:10:41,838 SparkMaster: Connected to Cassandra cluster: 4299 > INFO 22:10:41,839 SparkMaster: Adding host 172.31.11.204 (Analytics) > INFO 22:10:41,840 SparkMaster: New Cassandra host /172.31.11.204:9042 added > INFO 22:10:41,841 SparkMaster: Adding host 172.31.11.204 (Analytics) > INFO 22:10:41,842 SparkMaster: Adding host 172.31.11.204 (Analytics) > INFO 22:10:41,852 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,853 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,853 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,857 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,862 SparkMaster: Adding host 172.31.11.204 (Analytics) > WARN 22:10:42,200 SparkMaster: Got status update for unknown executor > app-20141106221014-/0 > INFO 22:10:42,211 SparkWorker: Executor app-20141106221014-/0 finished > with state KILLED exitStatus 143 > {code} > /var/lib/spark/work/app-20141106221014-/0/stdout is on the disk. It is > trying to write to a close IO stream. > Spark worker shuts down by {code} > private def killProcess(message: Option[String]) { > var exitCode: Option[Int] = None > logInfo("Killing process!") > process.destroy() > process.waitFor() > if (stdoutAppender != null) { > stdoutAppender.stop() > } > if (stderrAppender != null) { > stderrAppender.stop() > } > if (process != null) { > exitCode = Some(process.waitFor()) > } > worker ! ExecutorStateChanged(appId, execId, state, message, exitCode) > > {code} > But stdoutAppender concurrently writes to output log file, which creates race > condition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (SPARK-4300) Race condition during SparkWorker shutdown
[ https://issues.apache.org/jira/browse/SPARK-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-4300: -- > Race condition during SparkWorker shutdown > -- > > Key: SPARK-4300 > URL: https://issues.apache.org/jira/browse/SPARK-4300 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Alex Liu >Assignee: Sean Owen >Priority: Minor > Labels: backport-needed > Fix For: 1.2.2, 1.4.0 > > > When a shark job is done. there are some error message as following show in > the log > {code} > INFO 22:10:41,635 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,640 SparkMaster: Removing app app-20141106221014- > INFO 22:10:41,687 SparkMaster: Removing application > Shark::ip-172-31-11-204.us-west-1.compute.internal > INFO 22:10:41,710 SparkWorker: Asked to kill executor > app-20141106221014-/0 > INFO 22:10:41,712 SparkWorker: Runner thread for executor > app-20141106221014-/0 interrupted > INFO 22:10:41,714 SparkWorker: Killing process! > ERROR 22:10:41,738 SparkWorker: Error writing stream to file > /var/lib/spark/work/app-20141106221014-/0/stdout > ERROR 22:10:41,739 SparkWorker: java.io.IOException: Stream closed > ERROR 22:10:41,739 SparkWorker: at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162) > ERROR 22:10:41,740 SparkWorker: at > java.io.BufferedInputStream.read1(BufferedInputStream.java:272) > ERROR 22:10:41,740 SparkWorker: at > java.io.BufferedInputStream.read(BufferedInputStream.java:334) > ERROR 22:10:41,740 SparkWorker: at > java.io.FilterInputStream.read(FilterInputStream.java:107) > ERROR 22:10:41,741 SparkWorker: at > org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) > ERROR 22:10:41,741 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) > ERROR 22:10:41,741 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > ERROR 22:10:41,742 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > ERROR 22:10:41,742 SparkWorker: at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > ERROR 22:10:41,742 SparkWorker: at > org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) > INFO 22:10:41,838 SparkMaster: Connected to Cassandra cluster: 4299 > INFO 22:10:41,839 SparkMaster: Adding host 172.31.11.204 (Analytics) > INFO 22:10:41,840 SparkMaster: New Cassandra host /172.31.11.204:9042 added > INFO 22:10:41,841 SparkMaster: Adding host 172.31.11.204 (Analytics) > INFO 22:10:41,842 SparkMaster: Adding host 172.31.11.204 (Analytics) > INFO 22:10:41,852 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,853 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,853 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,857 SparkMaster: > akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got > disassociated, removing it. > INFO 22:10:41,862 SparkMaster: Adding host 172.31.11.204 (Analytics) > WARN 22:10:42,200 SparkMaster: Got status update for unknown executor > app-20141106221014-/0 > INFO 22:10:42,211 SparkWorker: Executor app-20141106221014-/0 finished > with state KILLED exitStatus 143 > {code} > /var/lib/spark/work/app-20141106221014-/0/stdout is on the disk. It is > trying to write to a close IO stream. > Spark worker shuts down by {code} > private def killProcess(message: Option[String]) { > var exitCode: Option[Int] = None > logInfo("Killing process!") > process.destroy() > process.waitFor() > if (stdoutAppender != null) { > stdoutAppender.stop() > } > if (stderrAppender != null) { > stderrAppender.stop() > } > if (process != null) { > exitCode = Some(process.waitFor()) > } > worker ! ExecutorStateChanged(appId, execId, state, message, exitCode) > > {code} > But stdoutAppender concurrently writes to output log file, which creates race > condition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.a
[jira] [Updated] (SPARK-5546) Improve path to Kafka assembly when trying Kafka Python API
[ https://issues.apache.org/jira/browse/SPARK-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5546: - Affects Version/s: 1.3.0 > Improve path to Kafka assembly when trying Kafka Python API > --- > > Key: SPARK-5546 > URL: https://issues.apache.org/jira/browse/SPARK-5546 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6018) NoSuchMethodError in Spark app is swallowed by YARN AM
[ https://issues.apache.org/jira/browse/SPARK-6018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6018: - Assignee: Cheolsoo Park > NoSuchMethodError in Spark app is swallowed by YARN AM > -- > > Key: SPARK-6018 > URL: https://issues.apache.org/jira/browse/SPARK-6018 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park >Priority: Minor > Labels: yarn > Fix For: 1.3.0, 1.2.2 > > > I discovered this bug while testing 1.3 RC with old 1.2 Spark job that I had. > Due to changes in DF and SchemaRDD, my app failed with > {{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the > error was silently swallowed. > The problem is that pattern matching in Spark AM fails to catch > NoSuchMethodError- > {code} > 15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > Exception in thread "Driver" scala.MatchError: java.lang.NoSuchMethodError: > org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD; > (of class java.lang.NoSuchMethodError) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6018) NoSuchMethodError in Spark app is swallowed by YARN AM
[ https://issues.apache.org/jira/browse/SPARK-6018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6018: - Affects Version/s: 1.2.0 > NoSuchMethodError in Spark app is swallowed by YARN AM > -- > > Key: SPARK-6018 > URL: https://issues.apache.org/jira/browse/SPARK-6018 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Cheolsoo Park >Priority: Minor > Labels: yarn > Fix For: 1.3.0, 1.2.2 > > > I discovered this bug while testing 1.3 RC with old 1.2 Spark job that I had. > Due to changes in DF and SchemaRDD, my app failed with > {{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the > error was silently swallowed. > The problem is that pattern matching in Spark AM fails to catch > NoSuchMethodError- > {code} > 15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > Exception in thread "Driver" scala.MatchError: java.lang.NoSuchMethodError: > org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD; > (of class java.lang.NoSuchMethodError) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6018) NoSuchMethodError in Spark app is swallowed by YARN AM
[ https://issues.apache.org/jira/browse/SPARK-6018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-6018. Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 Target Version/s: 1.3.0, 1.2.2 > NoSuchMethodError in Spark app is swallowed by YARN AM > -- > > Key: SPARK-6018 > URL: https://issues.apache.org/jira/browse/SPARK-6018 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Cheolsoo Park >Priority: Minor > Labels: yarn > Fix For: 1.3.0, 1.2.2 > > > I discovered this bug while testing 1.3 RC with old 1.2 Spark job that I had. > Due to changes in DF and SchemaRDD, my app failed with > {{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the > error was silently swallowed. > The problem is that pattern matching in Spark AM fails to catch > NoSuchMethodError- > {code} > 15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > Exception in thread "Driver" scala.MatchError: java.lang.NoSuchMethodError: > org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD; > (of class java.lang.NoSuchMethodError) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org