Problem with changing the akka.framesize parameter
I am trying to run a spark application with -Dspark.executor.memory=30g -Dspark.kryoserializer.buffer.max.mb=2000 -Dspark.akka.frameSize=1 and the job fails because one or more of the akka frames are larger than 1mb (12000 ish). When I change the Dspark.akka.frameSize=1 to 12000,15000 and 2 and RUN: ./spark/bin/spark-submit --driver-memory 30g --executor-memory 30g mySparkCode.py I get an error in the startup as : ERROR OneForOneStrategy: Cannot instantiate transport [akka.remote.transport.netty.NettyTransport]. Make sure it extends [akka.remote.transport.Transport] and ha s constructor with [akka.actor.ExtendedActorSystem] and [com.typesafe.config.Config] parameters java.lang.IllegalArgumentException: Cannot instantiate transport [akka.remote.transport.netty.NettyTransport]. Make sure it extends [akka.remote.transport.Transport] and has const ructor with [akka.actor.ExtendedActorSystem] and [com.typesafe.config.Config] parameters at akka.remote.EndpointManager$$anonfun$8$$anonfun$3.applyOrElse(Remoting.scala:620) at akka.remote.EndpointManager$$anonfun$8$$anonfun$3.applyOrElse(Remoting.scala:618) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) at scala.util.Try$.apply(Try.scala:161) at scala.util.Failure.recover(Try.scala:185) at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:618) at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:610) at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721) at akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:610) at akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:450) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.IllegalArgumentException: requirement failed: Setting 'maximum-frame-size' must be at least 32000 bytes at scala.Predef$.require(Predef.scala:233) at akka.util.Helpers$Requiring$.requiring$extension1(Helpers.scala:104) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1446) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1442) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:153) at org.apache.spark.SparkContext.init(SparkContext.scala:203) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Can anyone give me a clue about whats going wrong here?? I am running spark 1.1.0 in r3.2xlarge EC2
Re: Parquet compression codecs not applied
Hi Ayoub, You could try using the sql format to set the compression type: sc = SparkContext() sqc = SQLContext(sc) sqc.sql(SET spark.sql.parquet.compression.codec=gzip) You get a notification on screen while running the spark job when you set the compression codec like this. I havent compared it with different compression methods, Please let the mailing list knows if this works for you. Best Sahan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058p21498.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Error when Applying schema to a dictionary with a Tuple as key
Hi Guys, Im running a spark cluster in AWS with Spark 1.1.0 in EC2 I am trying to convert a an RDD with tuple (u'string', int , {(int, int): int, (int, int): int}) to a schema rdd using the schema: fields = [StructField('field1',StringType(),True), StructField('field2',IntegerType(),True), StructField('field3',MapType(StructType([StructField('field31',IntegerType(),True), StructField('field32',IntegerType(),True)]),IntegerType(),True),True) ] schema = StructType(fields) # generate the schemaRDD with the defined schema schemaRDD = sqc.applySchema(RDD, schema) But when I add field3 to the schema, it throws an execption: Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/rdd.py, line 1153, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /root/spark/python/pyspark/context.py, line 770, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 710, ip-172-31-29-120.ec2.internal): net.razorvine.pickle.PickleException: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_map(Pickler.java:321) net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412) net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412) net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.dump(Pickler.java:95) net.razorvine.pickle.Pickler.dumps(Pickler.java:80) org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417) org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:331) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
Error when mapping a schema RDD when converting lists
Hi Guys, I used applySchema to store a set of nested dictionaries and lists in a parquet file. http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-td20228.html#a20461 It was successful and i could successfully load the data as well.Now im trying to convert this SchemaRDD to a RDD of dictionaries so that I can run some reduces on them. The schema of my RDD is as follows: |-- field1: string (nullable = true) |-- field2: integer (nullable = true) |-- field3: map (nullable = true) ||-- key: integer ||-- value: integer (valueContainsNull = true) |-- field4: map (nullable = true) ||-- key: string ||-- value: integer (valueContainsNull = true) |-- field5: array (nullable = true) ||-- element: string (containsNull = true) |-- field6: array (nullable = true) ||-- element: struct (containsNull = true) |||-- field61: string (nullable = true) |||-- field62: string (nullable = true) |||-- field63: integer (nullable = true) And Im using the following mapper to map these fields to a RDD that I can reduce later. def generateRecords(line): # input : the row stored in parquet file # output : a python dictionary with all the key value pairs field1 = line.field1 summary = {} summary['field2'] = line.field2 summary['field3'] = line.field3 summary['field4'] = line.field4 summary['field5'] = line.field5 summary['field6'] = line.field6 return (guid,summary) profiles = sqc.parquetFile(path) profileRecords = profiles.map(lambda line: generateRecords(line)) This code works perfectly well when field6 is not mapped. IE when you comment out the line that maps field6 in generateRecords. the RDD gets generated perfoectly. Even field 5 gets mapped. The key difference between field 5 and 6 are, field5 is a list of strings and field 6 is a list of tupes in the forma (String, String, Int) . But when you try to map field6, it throws : Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/rdd.py, line 847, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /root/spark/python/pyspark/rdd.py, line 838, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /root/spark/python/pyspark/rdd.py, line 759, in reduce vals = self.mapPartitions(func).collect() File /root/spark/python/pyspark/rdd.py, line 723, in collect bytesInJava = self._jrdd.collect().iterator() File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o88.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 32 in stage 3.0 failed 4 times, most recent failure: Lost task 32.3 in stage 3.0 (TID 1829, ip-172-31-18-36.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /root/spark/python/pyspark/worker.py, line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File /root/spark/python/pyspark/serializers.py, line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /root/spark/python/pyspark/serializers.py, line 128, in dump_stream self._write_with_length(obj, stream) File /root/spark/python/pyspark/serializers.py, line 138, in _write_with_length serialized = self.dumps(obj) File /root/spark/python/pyspark/serializers.py, line 356, in dumps return cPickle.dumps(obj, 2) PicklingError: Can't pickle class 'pyspark.sql.List': attribute lookup pyspark.sql.List failed Can someone help me to understand what is going wrong here. Many thanks SahanB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-mapping-a-schema-RDD-when-converting-lists-tp20577.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Error when mapping a schema RDD when converting lists
As a tempary fix, it works when I convert field six to a list manually. That is: def generateRecords(line): # input : the row stored in parquet file # output : a python dictionary with all the key value pairs field1 = line.field1 summary = {} summary['field2'] = line.field2 summary['field3'] = line.field3 summary['field4'] = line.field4 summary['field5'] = line.field5 *summary['field6'] = list(line.field6) * return (field1,summary) profiles = sqc.parquetFile(path) profileRecords = profiles.map(lambda line: generateRecords(line)) Works!! But I am not convinced this is the best I could do :) Cheers SahanB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-mapping-a-schema-RDD-when-converting-lists-tp20577p20579.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD
I worked man.. Thanks alot :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20461.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD
Hi Davies, Thanks for the reply The problem is I have empty dictionaries in my field3 as well. It gives me an error : Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/sql.py, line 1042, in inferSchema schema = _infer_schema(first) File /root/spark/python/pyspark/sql.py, line 495, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /root/spark/python/pyspark/sql.py, line 460, in _infer_type raise ValueError(Can not infer type for empty dict) ValueError: Can not infer type for empty dict When I remove the empty dictionary items from each record. That is, when mapping to the main dictionary, if field3 is an empty ditc, i do not include that hence the record converts from { field1:5, field2: 'string', field3: {} } to { field1:5, field2: 'string', } At this point, I get : ERROR TaskSetManager: Task 0 in stage 14.0 failed 4 times; aborting job Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/sql.py, line 1044, in inferSchema return self.applySchema(rdd, schema) File /root/spark/python/pyspark/sql.py, line 1117, in applySchema rows = rdd.take(10) File /root/spark/python/pyspark/rdd.py, line 1153, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /root/spark/python/pyspark/context.py, line 770, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 (TID 22628, ip-172-31-30-89.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /root/spark/python/pyspark/worker.py, line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File /root/spark/python/pyspark/serializers.py, line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /root/spark/python/pyspark/serializers.py, line 127, in dump_stream for obj in iterator: File /root/spark/python/pyspark/serializers.py, line 185, in _batched for item in iterator: File /root/spark/python/pyspark/rdd.py, line 1148, in takeUpToNumLeft yield next(iterator) File /root/spark/python/pyspark/sql.py, line 552, in _drop_schema yield converter(i) File /root/spark/python/pyspark/sql.py, line 540, in nested_conv return tuple(f(v) for f, v in zip(convs, conv(row))) File /root/spark/python/pyspark/sql.py, line 540, in genexpr return tuple(f(v) for f, v in zip(convs, conv(row))) File /root/spark/python/pyspark/sql.py, line 508, in lambda return lambda row: dict((k, conv(v)) for k, v in row.iteritems()) AttributeError: 'int' object has no attribute 'iteritems' I am clueless what to do about this. Hope you can help :) Many thanks SahanB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20364.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD
Hi Guys, I am trying to use SparkSQL to convert an RDD to SchemaRDD so that I can save it in parquet format. A record in my RDD has the following format: RDD1 { field1:5, field2: 'string', field3: {'a':1, 'c':2} } I am using field3 to represent a sparse vector and it can have keys: 'a','b' or 'c' and values any int value The current approach I am using is : schemaRDD1 = sqc.jsonRDD(RDD1.map(lambda x: simplejson.dumps(x))) But when I do this, the dictionary in field 3 also gets converted to a SparkSQL Row. This converts field3 to be a dense data structure where it holds value None for every key that is not present in the field 3 for each record. When I do test = RDD1.map(lambda x: simplejson.dumps(x)) test.first() my output is: {field1: 5, field2:string, field3 :{a:1,c:2}} But then when I do schemaRDD = sqc.jsonRDD(test) schemaRDD.first() my output is : Row( field1=5, field2='string', field3 = Row(a=1,b=None,c=2)) in realty, I have 1000s of probable keys in field 3 and only 2 to 3 of them occur per record. So When tic converts to a Row, it generates thousands of None fields per record. Is there anyways for me to store field3 as a dictionary instead of converting it into a Row in the schemaRDD?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Using a compression codec in saveAsSequenceFile in Pyspark (Python API)
Hi, I am trying to save an RDD to an S3 bucket using RDD.saveAsSequenceFile(self, path, CompressionCodec) function in python. I need to save the RDD in GZIP. Can anyone tell me how to send the gzip codec class as a parameter into the function. I tried *RDD.saveAsSequenceFile('{0}{1}'.format(outputFolder,datePath),compressionCodecClass=gzip.GzipFile)* but it hits me with a : *AttributeError: type object 'GzipFile' has no attribute '_get_object_id' * Which I think is because JVM cant find the scala mapping gzip. *If you can guide me about any method to write the RDD as a gzip(.gz) into disc that is very much appreciated. * Many thanks SahanB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-a-compression-codec-in-saveAsSequenceFile-in-Pyspark-Python-API-tp18899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org