New Amazon AMIs for EC2 script

2017-02-23 Thread in4maniac
interested to know what things need to be in the AMI, if I wanted to build an AMI from scratch (Last resort :( ) And isn't it time to have a ticket in the spark project to build a new suite of AMIs for the EC2 script? https://issues.apache.org/jira/browse/SPARK-922 Many thanks in4maniac

listening to recursive folder structures in s3 using pyspark streaming (textFileStream)

2016-02-17 Thread in4maniac
. Can someone please explain to me how I could achieve my objective? thanks in advance !!! in4maniac -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/listening-to-recursive-folder-structures-in-s3-using-pyspark-streaming-textFileStream-tp26247.html Sent from t

Re: Loading file content based on offsets into the memory

2015-05-10 Thread in4maniac
As far as I know, that is not possible. If the file is too big to load to one node, What I would do is to use a RDD.map() function instead to load the file to distributed memory and then filter the lines that are relevant to me. I am not sure how to just read part of a single file. Sorry I'm

Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread in4maniac
HI GUYS... I realised that it was a bug in my code that caused the code to break.. I was running the filter on a SchemaRDD when I was supposed to be running it on an RDD. But I still don't understand why the stderr was about S3 request rather than a type checking error such as No tuple position

AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-07 Thread in4maniac
Hi Guys, I think this problem is related to : http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-td8689.html I am running pyspark 1.2.1 in AWS with my AWS credentials exported to master node as Environmental Variables. Halfway through my application, I

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread in4maniac
Hi V, I am assuming that each of the three .parquet paths you mentioned have multiple partitions in them. For eg: [/dataset/city=London/data.parquet/part-r-0.parquet, /dataset/city=London/data.parquet/part-r-1.parquet] I haven't personally used this with hdfs, but I've worked with a similar