Re: Get S3 Parquet File

2017-02-23 Thread Femi Anthony
Have you tried reading using s3n which is a slightly older protocol ? I'm not sure how compatible s3a is with older versions of Spark. Femi On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim wrote: > Hi Gourav, > > My answers are below. > > Cheers, > Ben > > > On Feb 23, 2017,

Re: Get S3 Parquet File

2017-02-23 Thread Benjamin Kim
Hi Gourav, My answers are below. Cheers, Ben > On Feb 23, 2017, at 10:57 PM, Gourav Sengupta > wrote: > > Can I ask where are you running your CDH? Is it on premise or have you > created a cluster for yourself in AWS? Our cluster in on premise in our data >

Fwd: Duplicate Rank for within same partitions

2017-02-23 Thread Dana Ram Meghwal
-- Forwarded message -- From: Dana Ram Meghwal Date: Thu, Feb 23, 2017 at 10:40 PM Subject: Duplicate Rank for within same partitions To: user-h...@spark.apache.org Hey Guys, I am new to spark. I am trying to write a spark script which involves finding rank

Re: Get S3 Parquet File

2017-02-23 Thread Gourav Sengupta
Can I ask where are you running your CDH? Is it on premise or have you created a cluster for yourself in AWS? Also I have really never seen use s3a before, that was used way long before when writing s3 files took a long time, but I think that you are reading it. Anyideas why you are not

Apache Spark MLIB

2017-02-23 Thread Mina Aslani
Hi, I am going to start working on anomaly detection using Spark MLIB. Please note that I have not used Spark so far. I would like to read some data and if a user logged in from different ip address which is not common consider it as an anomaly, similar to what apple/google does. My preferred

Spark: Continuously reading data from Cassandra

2017-02-23 Thread Tech Id
Hi, Can anyone help with http://stackoverflow.com/questions/42428080/spark-continuously-reading-data-from-cassandra ? Thanks TI

Spark executor on Docker runs as root

2017-02-23 Thread Ji Yan
Dear spark users, When running Spark on Docker, the spark executors by default always run as root. Is there a way to change this to other users? Thanks Ji -- The information in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this

Re: Get S3 Parquet File

2017-02-23 Thread Benjamin Kim
Aakash, Here is a code snippet for the keys. val accessKey = “---" val secretKey = “---" val hadoopConf = sc.hadoopConfiguration hadoopConf.set("fs.s3a.access.key", accessKey) hadoopConf.set("fs.s3a.secret.key", secretKey) hadoopConf.set("spark.hadoop.fs.s3a.access.key",accessKey)

Re: Get S3 Parquet File

2017-02-23 Thread Aakash Basu
Hey, Please recheck your access key and secret key being used to fetch the parquet file. It seems to be a credential error. Either mismatch/load. If load, then first use it directly in code and see if the issue resolves, then it can be hidden and read from Input Params. Thanks, Aakash. On

Get S3 Parquet File

2017-02-23 Thread Benjamin Kim
We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below.

Re: Spark SQL : Join operation failure

2017-02-23 Thread neil90
It might be a memory issue. Try adding .persist(MEMORY_AND_DISK_ONLY) so that if the RDD can't fit into memory it will persist parts of the RDD into disk. cm_go.registerTempTable("x") ko.registerTempTable("y") joined_df = sqlCtx.sql("select * from x FULL OUTER JOIN y ON field1=field2")

Re: New Amazon AMIs for EC2 script

2017-02-23 Thread neil90
You should look into AWS EMR instead, with adding pip install steps to the launch process. They have a pretty nice Jupyter notebook script that setups up jupyter and lets you choose what packages you want to install -

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-23 Thread nguyen duc Tuan
I do a self-join. I tried to cache the transformed dataset before joining, but it didn't help too. 2017-02-23 13:25 GMT+07:00 Nick Pentreath : > And to be clear, are you doing a self-join for approx similarity? Or > joining to another dataset? > > > > On Thu, 23 Feb

Re: New Amazon AMIs for EC2 script

2017-02-23 Thread Nicholas Chammas
spark-ec2 has moved to GitHub and is no longer part of the Spark project. A related issue from the current issue tracker that you may want to follow/comment on is this one: https://github.com/amplab/spark-ec2/issues/74 As I said there, I think requiring custom AMIs is one of the major maintenance

Re: Structured Streaming: How to handle bad input

2017-02-23 Thread Sam Elamin
Hi Jayesh So you have 2 problems here 1) Data was loaded in the wrong format 2) Once you handled the wrong data the spark job will continually retry the failed batch For 2 its very easy to go into the checkpoint directory and delete that offset manually and make it seem like it never happened.

Structured Streaming: How to handle bad input

2017-02-23 Thread JayeshLalwani
What is a good way to make a Structured Streaming application deal with bad input? Right now, the problem is that bad input kills the Structured Streaming application. This is highly undesirable, because a Structured Streaming application has to be always on For example, here is a very simple

[Spark Streaming] Batch versus streaming

2017-02-23 Thread Charles O. Bajomo
Hello, I am reading data from a JMS queue and I need to prevent any data loss so I have custom java receiver that only acks messages once they have been stored. Sometimes my program crashes because I can't control the flow rate from the queue and it overwhelms the job and I end up losing

Re: unsubscribe

2017-02-23 Thread Ganesh
Thank you for cat facts. "A group of cats is called a clowder" MEEOW To unsubscribe please enter your credit card details followed by your pin. CAT-FACTS On 24/02/17 00:04, Donam Kim wrote: catunsub 2017-02-23 20:28 GMT+11:00 Ganesh Krishnan

Re: quick question: best to use cluster mode or client mode for production?

2017-02-23 Thread Sam Elamin
I personally use spark submit as it's agnostic to which platform your spark clusters are working on e.g. Emr dataproc databricks etc On Thu, 23 Feb 2017 at 08:53, nancy henry wrote: > Hi Team, > > I have set of hc.sql("hivequery") kind of scripts which i am running

New Amazon AMIs for EC2 script

2017-02-23 Thread in4maniac
Hyy all, I have been using the EC2 script to launch R pyspark clusters for a while now. As we use alot of packages such as numpy and scipy with openblas, scikit-learn, bokeh, vowpal wabbit, pystan and etc... All this time, we have been building AMIs on top of the standard spark-AMIs at

Spark join over sorted columns of dataset.

2017-02-23 Thread Rohit Verma
Hi While joining two columns of different dataset, how to optimize join if both the columns are pre sorted within the dataset. So that when spark do sort merge join the sorting phase can skipped. Regards Rohit - To unsubscribe

Scala functions for dataframes

2017-02-23 Thread Advait Mohan Raut
Hi Team, I am using Scala Spark Dataframes for data operations over CSV files. There is a common transformation code being used by multiple process flows. Hence I wish to create a Scala functions for that [with def fn_name()]. All process flows will use the functionality implemented inside these

Support for decimal separator (comma or period) in spark 2.1

2017-02-23 Thread Arkadiusz Bicz
Hi Team, I would like to know if it is possible to specify decimal localization for DataFrameReader for csv? I have cvs files from localization where decimal separator is comma like 0,32 instead of US way like 0.32 Is it a way to specify in current version of spark to provide localization:

Re: unsubscribe

2017-02-23 Thread Ganesh Krishnan
Thank you for subscribing to "cat facts" Did you know that a cat's whiskers is used to determine if it can wiggle through a hole? To unsubscribe reply with keyword "catunsub" Thank you On Feb 23, 2017 8:25 PM, "Donam Kim" wrote: > unsubscribe >

unsubscribe

2017-02-23 Thread Donam Kim
unsubscribe

quick question: best to use cluster mode or client mode for production?

2017-02-23 Thread nancy henry
Hi Team, I have set of hc.sql("hivequery") kind of scripts which i am running right now in spark-shell How should i schedule it in production making it spark-shell -i script.scala or keeping it in jar file through eclipse and use spark-submit deploy mode cluster? which is advisable?