Re: debug jsonRDD problem?
On Wed, May 27, 2015 at 02:06:16PM -0700, Ted Yu wrote: Looks like the exception was caused by resolved.get(prefix ++ a) returning None : a => StructField(a.head, resolved.get(prefix ++ a).get, nullable = true) There are three occurrences of resolved.get() in createSchema() - None should be better handled in these places. My two cents. Here's the simplest test case I've come up with: sqlContext.jsonRDD(sc.parallelize(Array("{\"'```'\":\"\"}"))).count() Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: debug jsonRDD problem?
On Wed, May 27, 2015 at 01:13:43PM -0700, Ted Yu wrote: Can you tell us a bit more about (schema of) your JSON ? It's fairly simple, consisting of 22 fields with values that are mostly strings or integers, except that some of the fields are objects with http header/value pairs. I'd guess it's something in those latter fields that is causing the problems. The data is 800M rows that I didn't create in the first place and I'm in the process of making a simpler test case. What I was mostly wondering is if there were an obvious mechanism that I'm just missing to get jsonRDD to spit out more information about which specific rows it's having problems with. You can find sample JSON in sql/core/src/test//scala/org/apache/spark/sql/json/ TestJsonData.scala I know the jsonRDD works in general, I've used it before without problems. It even works on subsets of this data. Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
debug jsonRDD problem?
Can anyone provide some suggestions on how to debug this? Using spark 1.3.1. The json itself seems to be valid (other programs can parse it) and the problem seems to lie in jsonRDD trying to describe & use a schema. scala> sqlContext.jsonRDD(rdd).count() java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.sql.json.JsonRDD$$anonfun$14.apply(JsonRDD.scala:105) at org.apache.spark.sql.json.JsonRDD$$anonfun$14.apply(JsonRDD.scala:101) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$makeStruct$1(JsonRDD.scala:101) at org.apache.spark.sql.json.JsonRDD$$anonfun$14.apply(JsonRDD.scala:104) at org.apache.spark.sql.json.JsonRDD$$anonfun$14.apply(JsonRDD.scala:101) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.Map$Map2.foreach(Map.scala:130) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$makeStruct$1(JsonRDD.scala:101) at org.apache.spark.sql.json.JsonRDD$.createSchema(JsonRDD.scala:132) at org.apache.spark.sql.json.JsonRDD$.inferSchema(JsonRDD.scala:56) at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:635) at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:581) [...] - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
dynamicAllocation & spark-shell
If I enable dynamicAllocation and then use spark-shell or pyspark, things start out working as expected: running simple commands causes new executors to start and complete tasks. If the shell is left idle for a while, executors start getting killed off: 15/04/23 10:52:43 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 368 15/04/23 10:52:43 INFO spark.ExecutorAllocationManager: Removing executor 368 because it has been idle for 600 seconds (new desired total will be 665) That makes sense. But the action also results in error messages: 15/04/23 10:52:47 ERROR cluster.YarnScheduler: Lost executor 368 on hostname: remote Akka client disassociated 15/04/23 10:52:47 INFO scheduler.DAGScheduler: Executor lost: 368 (epoch 0) 15/04/23 10:52:47 INFO spark.ExecutorAllocationManager: Existing executor 368 has been removed (new total is 665) 15/04/23 10:52:47 INFO storage.BlockManagerMasterActor: Trying to remove executor 368 from BlockManagerMaster. 15/04/23 10:52:47 INFO storage.BlockManagerMasterActor: Removing block manager BlockManagerId(368, hostname, 35877) 15/04/23 10:52:47 INFO storage.BlockManagerMaster: Removed 368 successfully in removeExecutor After that, trying to run a simple command results in: 15/04/23 10:13:30 ERROR util.Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -663 from the cluster manager. Please specify a positive number! And then only the single remaining executor attempts to complete the new tasks. Am I missing some kind of simple configuration item, are other people seeing the same behavior as a bug, or is this actually expected? Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.dynamicAllocation.minExecutors
On Thu, Apr 16, 2015 at 12:16:13PM -0700, Marcelo Vanzin wrote: I think Michael is referring to this: """ Exception in thread "main" java.lang.IllegalArgumentException: You must specify at least 1 executor! Usage: org.apache.spark.deploy.yarn.Client [options] """ Yes, sorry, there were too many mins and maxs and I copied the wrong line. Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.dynamicAllocation.minExecutors
On Thu, Apr 16, 2015 at 08:10:54PM +0100, Sean Owen wrote: Yes, look what it was before -- would also reject a minimum of 0. That's the case you are hitting. 0 is a fine minimum. How can 0 be a fine minimum if it's rejected? Changing the value is easy enough, but in general it's nice for defaults to make sense. Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.dynamicAllocation.minExecutors
On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote: IIRC that was fixed already in 1.3 https://github.com/apache/spark/commit/b2047b55c5fc85de6b63276d8ab9610d2496e08b From that commit: + private val minNumExecutors = conf.getInt("spark.dynamicAllocation.minExecutors", 0) ... + if (maxNumExecutors == 0) { + throw new SparkException("spark.dynamicAllocation.maxExecutors cannot be 0!") - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark.dynamicAllocation.minExecutors
The default for spark.dynamicAllocation.minExecutors is 0, but that value causes a runtime error and a message that the minimum is 1. Perhaps the default should be changed to 1? Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class
I've also been having trouble running 1.3.0 on HDP. The spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 configuration directive seems to work with pyspark, but not propagate when using spark-shell. (That is, everything works find with pyspark, and spark-shell fails with the "bad substitution" message.) Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org