from:"sahanbull"

Problem with changing the akka.framesize parameter

2015-02-04 Thread sahanbull

I am trying to run a spark application with 

-Dspark.executor.memory=30g -Dspark.kryoserializer.buffer.max.mb=2000
-Dspark.akka.frameSize=1 

and the job fails because one or more of the akka frames are larger than
1mb (12000 ish). 

When I change the Dspark.akka.frameSize=1 to 12000,15000 and 2 and
RUN:

./spark/bin/spark-submit  --driver-memory 30g --executor-memory 30g
mySparkCode.py

I get an error in the startup as :


ERROR OneForOneStrategy: Cannot instantiate transport
[akka.remote.transport.netty.NettyTransport]. Make sure it extends
[akka.remote.transport.Transport] and ha
s constructor with [akka.actor.ExtendedActorSystem] and
[com.typesafe.config.Config] parameters
java.lang.IllegalArgumentException: Cannot instantiate transport
[akka.remote.transport.netty.NettyTransport]. Make sure it extends
[akka.remote.transport.Transport] and has const
ructor with [akka.actor.ExtendedActorSystem] and
[com.typesafe.config.Config] parameters
at
akka.remote.EndpointManager$$anonfun$8$$anonfun$3.applyOrElse(Remoting.scala:620)
at
akka.remote.EndpointManager$$anonfun$8$$anonfun$3.applyOrElse(Remoting.scala:618)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Failure.recover(Try.scala:185)
at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:618)
at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:610)
at
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at
akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:610)
at
akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:450)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.IllegalArgumentException: requirement failed: Setting
'maximum-frame-size' must be at least 32000 bytes
at scala.Predef$.require(Predef.scala:233)
at
akka.util.Helpers$Requiring$.requiring$extension1(Helpers.scala:104)
at
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at
org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at
org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1446)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1442)
at
org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:153)
at org.apache.spark.SparkContext.init(SparkContext.scala:203)
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

Can anyone give me a clue about whats going wrong here??

I am running spark 1.1.0 in r3.2xlarge EC2

Re: Parquet compression codecs not applied

2015-02-04 Thread sahanbull

Hi Ayoub,

You could try using the sql format to set the compression type:

sc = SparkContext()
sqc = SQLContext(sc)
sqc.sql(SET spark.sql.parquet.compression.codec=gzip)

You get a notification on screen while running the spark job when you set
the compression codec like this. I havent compared it with different
compression methods, Please let the mailing list knows if this works for
you. 

Best
Sahan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058p21498.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread sahanbull


Hi Guys,

Im running a spark cluster in AWS with Spark 1.1.0 in EC2

I am trying to convert a an RDD with tuple 

(u'string', int , {(int, int): int, (int, int): int})

to a schema rdd using the schema:

fields = [StructField('field1',StringType(),True),
StructField('field2',IntegerType(),True),

StructField('field3',MapType(StructType([StructField('field31',IntegerType(),True),

StructField('field32',IntegerType(),True)]),IntegerType(),True),True)
]

schema = StructType(fields)
# generate the schemaRDD with the defined schema 
schemaRDD = sqc.applySchema(RDD, schema)

But when I add field3 to the schema, it throws an execption: 

Traceback (most recent call last):
  File stdin, line 1, in module
  File /root/spark/python/pyspark/rdd.py, line 1153, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File /root/spark/python/pyspark/context.py, line 770, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
javaPartitions, allowLocal)
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage
28.0 (TID 710, ip-172-31-29-120.ec2.internal):
net.razorvine.pickle.PickleException: couldn't introspect javabean:
java.lang.IllegalArgumentException: wrong number of arguments
net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603)
net.razorvine.pickle.Pickler.dispatch(Pickler.java:299)
net.razorvine.pickle.Pickler.save(Pickler.java:125)
net.razorvine.pickle.Pickler.put_map(Pickler.java:321)
net.razorvine.pickle.Pickler.dispatch(Pickler.java:286)
net.razorvine.pickle.Pickler.save(Pickler.java:125)
net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412)
net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
net.razorvine.pickle.Pickler.save(Pickler.java:125)
net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412)
net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
net.razorvine.pickle.Pickler.save(Pickler.java:125)
net.razorvine.pickle.Pickler.dump(Pickler.java:95)
net.razorvine.pickle.Pickler.dumps(Pickler.java:80)
   
org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417)
   
org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:331)
   
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
   
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
   
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
   
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at

Error when mapping a schema RDD when converting lists

2014-12-08 Thread sahanbull

Hi Guys, 

I used applySchema to store a set of nested dictionaries and lists in a
parquet file. 

http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-td20228.html#a20461

It was successful and i could successfully load the data as well.Now im
trying to convert this SchemaRDD to a RDD of dictionaries so that I can run
some reduces on them. 

The schema of my RDD is as follows:
 |-- field1: string (nullable = true)
 |-- field2: integer (nullable = true)
 |-- field3: map (nullable = true)
 ||-- key: integer
 ||-- value: integer (valueContainsNull = true)
 |-- field4: map (nullable = true)
 ||-- key: string
 ||-- value: integer (valueContainsNull = true)
 |-- field5: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- field6: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- field61: string (nullable = true)
 |||-- field62: string (nullable = true)
 |||-- field63: integer (nullable = true)

And Im using the following mapper to map these fields to a RDD that I can
reduce later. 

def generateRecords(line):
# input : the row stored in parquet file
# output : a python dictionary with all the key value pairs
field1 = line.field1
summary = {}
summary['field2'] = line.field2
summary['field3'] = line.field3
summary['field4'] = line.field4
summary['field5'] = line.field5
summary['field6'] = line.field6 
return (guid,summary)

profiles = sqc.parquetFile(path)
profileRecords = profiles.map(lambda line: generateRecords(line))

This code works perfectly well when field6 is not mapped. IE when you
comment out the line that maps field6 in generateRecords. the RDD gets
generated perfoectly. Even field 5 gets mapped. The key difference between
field 5 and 6 are, field5 is a list of strings and field 6 is a list of
tupes in the forma (String, String, Int) . But when you try to map field6,
it throws :

Traceback (most recent call last):
  File stdin, line 1, in module
  File /root/spark/python/pyspark/rdd.py, line 847, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /root/spark/python/pyspark/rdd.py, line 838, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File /root/spark/python/pyspark/rdd.py, line 759, in reduce
vals = self.mapPartitions(func).collect()
  File /root/spark/python/pyspark/rdd.py, line 723, in collect
bytesInJava = self._jrdd.collect().iterator()
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o88.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 32
in stage 3.0 failed 4 times, most recent failure: Lost task 32.3 in stage
3.0 (TID 1829, ip-172-31-18-36.ec2.internal):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File /root/spark/python/pyspark/worker.py, line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File /root/spark/python/pyspark/serializers.py, line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File /root/spark/python/pyspark/serializers.py, line 128, in dump_stream
self._write_with_length(obj, stream)
  File /root/spark/python/pyspark/serializers.py, line 138, in
_write_with_length
serialized = self.dumps(obj)
  File /root/spark/python/pyspark/serializers.py, line 356, in dumps
return cPickle.dumps(obj, 2)
PicklingError: Can't pickle class 'pyspark.sql.List': attribute lookup
pyspark.sql.List failed

Can someone help me to understand what is going wrong here. 

Many thanks
SahanB




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-mapping-a-schema-RDD-when-converting-lists-tp20577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Error when mapping a schema RDD when converting lists

2014-12-08 Thread sahanbull

As a tempary fix, it works when I convert field six to a list manually. That
is: 

def generateRecords(line): 
# input : the row stored in parquet file 
# output : a python dictionary with all the key value pairs 
field1 = line.field1 
summary = {} 
summary['field2'] = line.field2 
summary['field3'] = line.field3 
summary['field4'] = line.field4 
summary['field5'] = line.field5 
*summary['field6'] = list(line.field6)  *
return (field1,summary) 

profiles = sqc.parquetFile(path) 
profileRecords = profiles.map(lambda line: generateRecords(line)) 

Works!!

But I am not convinced this is the best I could do :) 

Cheers
SahanB



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-mapping-a-schema-RDD-when-converting-lists-tp20577p20579.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-05 Thread sahanbull

I worked man.. Thanks alot :) 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20461.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-04 Thread sahanbull

Hi Davies, 

Thanks for the reply 

The problem is I have empty dictionaries in my field3 as well. It gives me
an error : 

Traceback (most recent call last):
  File stdin, line 1, in module
  File /root/spark/python/pyspark/sql.py, line 1042, in inferSchema
schema = _infer_schema(first)
  File /root/spark/python/pyspark/sql.py, line 495, in _infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File /root/spark/python/pyspark/sql.py, line 460, in _infer_type
raise ValueError(Can not infer type for empty dict)
ValueError: Can not infer type for empty dict

When I remove the empty dictionary items from each record. That is, when
mapping to the main dictionary, if field3 is an empty ditc, i do not include
that hence the record converts from 

{ 
field1:5, 
field2: 'string', 
field3: {} 
} 

to 

{ 
field1:5, 
field2: 'string', 
} 

At this point, I get : 

 ERROR TaskSetManager: Task 0 in stage 14.0 failed 4 times; aborting job
Traceback (most recent call last):
  File stdin, line 1, in module
  File /root/spark/python/pyspark/sql.py, line 1044, in inferSchema
return self.applySchema(rdd, schema)
  File /root/spark/python/pyspark/sql.py, line 1117, in applySchema
rows = rdd.take(10)
  File /root/spark/python/pyspark/rdd.py, line 1153, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File /root/spark/python/pyspark/context.py, line 770, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
javaPartitions, allowLocal)
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage
14.0 (TID 22628, ip-172-31-30-89.ec2.internal):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File /root/spark/python/pyspark/worker.py, line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File /root/spark/python/pyspark/serializers.py, line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File /root/spark/python/pyspark/serializers.py, line 127, in dump_stream
for obj in iterator:
  File /root/spark/python/pyspark/serializers.py, line 185, in _batched
for item in iterator:
  File /root/spark/python/pyspark/rdd.py, line 1148, in takeUpToNumLeft
yield next(iterator)
  File /root/spark/python/pyspark/sql.py, line 552, in _drop_schema
yield converter(i)
  File /root/spark/python/pyspark/sql.py, line 540, in nested_conv
return tuple(f(v) for f, v in zip(convs, conv(row)))
  File /root/spark/python/pyspark/sql.py, line 540, in genexpr
return tuple(f(v) for f, v in zip(convs, conv(row)))
  File /root/spark/python/pyspark/sql.py, line 508, in lambda
return lambda row: dict((k, conv(v)) for k, v in row.iteritems())
AttributeError: 'int' object has no attribute 'iteritems'

I am clueless what to do about this. Hope you can help :)

Many thanks
SahanB



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-03 Thread sahanbull

Hi Guys, 

I am trying to use SparkSQL to convert an RDD to SchemaRDD so that I can
save it in parquet format. 

A record in my RDD has the following format: 

RDD1
{
   field1:5,
   field2: 'string',
   field3: {'a':1, 'c':2}
}

I am using field3 to represent a sparse vector and it can have keys:
'a','b' or 'c' and values any int value

The current approach I am using is :
   schemaRDD1 = sqc.jsonRDD(RDD1.map(lambda x: simplejson.dumps(x)))

But when I do this, the dictionary in field 3 also gets converted to a
SparkSQL Row. This converts field3 to be a dense data structure where it
holds value None for every key that is not present in the field 3 for each
record. 

When  I do 

test = RDD1.map(lambda x: simplejson.dumps(x))
test.first()

my output is: {field1: 5, field2:string, field3 :{a:1,c:2}}

But then when I do 
 schemaRDD = sqc.jsonRDD(test)
 schemaRDD.first()

my output is : Row( field1=5, field2='string', field3 = Row(a=1,b=None,c=2))

in realty, I have 1000s of probable keys in field 3 and only 2 to 3 of them
occur per record. So When tic converts to a Row, it generates thousands of
None fields per record.
Is there anyways for me to store field3 as a dictionary instead of
converting it into a Row in the schemaRDD??





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

2014-11-13 Thread sahanbull

Hi, 

I am trying to save an RDD to an S3 bucket using
RDD.saveAsSequenceFile(self, path, CompressionCodec) function in python. I
need to save the RDD in GZIP. Can anyone tell me how to send the gzip codec
class as a parameter into the function. 

I tried
*RDD.saveAsSequenceFile('{0}{1}'.format(outputFolder,datePath),compressionCodecClass=gzip.GzipFile)*

but it hits me with a : *AttributeError: type object 'GzipFile' has no
attribute '_get_object_id' *
Which I think is because JVM cant find the scala mapping gzip. 

*If you can guide me about any method to write the RDD as a gzip(.gz) into
disc that is very much appreciated. *

Many thanks
SahanB



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-a-compression-codec-in-saveAsSequenceFile-in-Pyspark-Python-API-tp18899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Problem with changing the akka.framesize parameter

Re: Parquet compression codecs not applied

Error when Applying schema to a dictionary with a Tuple as key

Error when mapping a schema RDD when converting lists

Re: Error when mapping a schema RDD when converting lists

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

9 matches

Site Navigation

Mail list logo

Footer information