interested to know what things need to be in
the AMI, if I wanted to build an AMI from scratch (Last resort :( )
And isn't it time to have a ticket in the spark project to build a new suite
of AMIs for the EC2 script? https://issues.apache.org/jira/browse/SPARK-922
Many thanks
in4maniac
. Can someone please explain to me how I could
achieve my objective?
thanks in advance !!!
in4maniac
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/listening-to-recursive-folder-structures-in-s3-using-pyspark-streaming-textFileStream-tp26247.html
Sent from t
As far as I know, that is not possible. If the file is too big to load to one
node, What I would do is to use a RDD.map() function instead to load the
file to distributed memory and then filter the lines that are relevant to
me.
I am not sure how to just read part of a single file. Sorry I'm
HI GUYS... I realised that it was a bug in my code that caused the code to
break.. I was running the filter on a SchemaRDD when I was supposed to be
running it on an RDD.
But I still don't understand why the stderr was about S3 request rather than
a type checking error such as No tuple position
Hi Guys,
I think this problem is related to :
http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-td8689.html
I am running pyspark 1.2.1 in AWS with my AWS credentials exported to master
node as Environmental Variables.
Halfway through my application, I
Hi V,
I am assuming that each of the three .parquet paths you mentioned have
multiple partitions in them.
For eg: [/dataset/city=London/data.parquet/part-r-0.parquet,
/dataset/city=London/data.parquet/part-r-1.parquet]
I haven't personally used this with hdfs, but I've worked with a similar