Have you tried reading using s3n which is a slightly older protocol ? I'm
not sure how compatible s3a is with older versions of Spark.
Femi
On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim wrote:
> Hi Gourav,
>
> My answers are below.
>
> Cheers,
> Ben
>
>
> On Feb 23, 2017,
Hi Gourav,
My answers are below.
Cheers,
Ben
> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta
> wrote:
>
> Can I ask where are you running your CDH? Is it on premise or have you
> created a cluster for yourself in AWS? Our cluster in on premise in our data
>
-- Forwarded message --
From: Dana Ram Meghwal
Date: Thu, Feb 23, 2017 at 10:40 PM
Subject: Duplicate Rank for within same partitions
To: user-h...@spark.apache.org
Hey Guys,
I am new to spark. I am trying to write a spark script which involves
finding rank
Can I ask where are you running your CDH? Is it on premise or have you
created a cluster for yourself in AWS?
Also I have really never seen use s3a before, that was used way long before
when writing s3 files took a long time, but I think that you are reading
it.
Anyideas why you are not
Hi,
I am going to start working on anomaly detection using Spark MLIB. Please
note that I have not used Spark so far.
I would like to read some data and if a user logged in from different ip
address which is not common consider it as an anomaly, similar to what
apple/google does.
My preferred
Hi,
Can anyone help with
http://stackoverflow.com/questions/42428080/spark-continuously-reading-data-from-cassandra
?
Thanks
TI
Dear spark users,
When running Spark on Docker, the spark executors by default always run as
root. Is there a way to change this to other users?
Thanks
Ji
--
The information in this email is confidential and may be legally
privileged. It is intended solely for the addressee. Access to this
Aakash,
Here is a code snippet for the keys.
val accessKey = “---"
val secretKey = “---"
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3a.access.key", accessKey)
hadoopConf.set("fs.s3a.secret.key", secretKey)
hadoopConf.set("spark.hadoop.fs.s3a.access.key",accessKey)
Hey,
Please recheck your access key and secret key being used to fetch the
parquet file. It seems to be a credential error. Either mismatch/load. If
load, then first use it directly in code and see if the issue resolves,
then it can be hidden and read from Input Params.
Thanks,
Aakash.
On
We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet
file from AWS S3. We can read the schema and show some data when the file is
loaded into a DataFrame, but when we try to do some operations, such as count,
we get this error below.
It might be a memory issue. Try adding .persist(MEMORY_AND_DISK_ONLY) so that
if the RDD can't fit into memory it will persist parts of the RDD into disk.
cm_go.registerTempTable("x")
ko.registerTempTable("y")
joined_df = sqlCtx.sql("select * from x FULL OUTER JOIN y ON field1=field2")
You should look into AWS EMR instead, with adding pip install steps to the
launch process. They have a pretty nice Jupyter notebook script that setups
up jupyter and lets you choose what packages you want to install -
I do a self-join. I tried to cache the transformed dataset before joining,
but it didn't help too.
2017-02-23 13:25 GMT+07:00 Nick Pentreath :
> And to be clear, are you doing a self-join for approx similarity? Or
> joining to another dataset?
>
>
>
> On Thu, 23 Feb
spark-ec2 has moved to GitHub and is no longer part of the Spark project. A
related issue from the current issue tracker that you may want to
follow/comment on is this one: https://github.com/amplab/spark-ec2/issues/74
As I said there, I think requiring custom AMIs is one of the major
maintenance
Hi Jayesh
So you have 2 problems here
1) Data was loaded in the wrong format
2) Once you handled the wrong data the spark job will continually retry the
failed batch
For 2 its very easy to go into the checkpoint directory and delete that
offset manually and make it seem like it never happened.
What is a good way to make a Structured Streaming application deal with bad
input? Right now, the problem is that bad input kills the Structured
Streaming application. This is highly undesirable, because a Structured
Streaming application has to be always on
For example, here is a very simple
Hello,
I am reading data from a JMS queue and I need to prevent any data loss so I
have custom java receiver that only acks messages once they have been stored.
Sometimes my program crashes because I can't control the flow rate from the
queue and it overwhelms the job and I end up losing
Thank you for cat facts.
"A group of cats is called a clowder"
MEEOW
To unsubscribe please enter your credit card details followed by your pin.
CAT-FACTS
On 24/02/17 00:04, Donam Kim wrote:
catunsub
2017-02-23 20:28 GMT+11:00 Ganesh Krishnan
I personally use spark submit as it's agnostic to which platform your spark
clusters are working on e.g. Emr dataproc databricks etc
On Thu, 23 Feb 2017 at 08:53, nancy henry wrote:
> Hi Team,
>
> I have set of hc.sql("hivequery") kind of scripts which i am running
Hyy all,
I have been using the EC2 script to launch R pyspark clusters for a while
now. As we use alot of packages such as numpy and scipy with openblas,
scikit-learn, bokeh, vowpal wabbit, pystan and etc... All this time, we have
been building AMIs on top of the standard spark-AMIs at
Hi
While joining two columns of different dataset, how to optimize join if both
the columns are pre sorted within the dataset.
So that when spark do sort merge join the sorting phase can skipped.
Regards
Rohit
-
To unsubscribe
Hi Team,
I am using Scala Spark Dataframes for data operations over CSV files.
There is a common transformation code being used by multiple process flows.
Hence I wish to create a Scala functions for that [with def fn_name()].
All process flows will use the functionality implemented inside these
Hi Team,
I would like to know if it is possible to specify decimal localization for
DataFrameReader for csv?
I have cvs files from localization where decimal separator is comma like
0,32 instead of US way like 0.32
Is it a way to specify in current version of spark to provide localization:
Thank you for subscribing to "cat facts"
Did you know that a cat's whiskers is used to determine if it can wiggle
through a hole?
To unsubscribe reply with keyword "catunsub"
Thank you
On Feb 23, 2017 8:25 PM, "Donam Kim" wrote:
> unsubscribe
>
unsubscribe
Hi Team,
I have set of hc.sql("hivequery") kind of scripts which i am running right
now in spark-shell
How should i schedule it in production
making it spark-shell -i script.scala
or keeping it in jar file through eclipse and use spark-submit deploy mode
cluster?
which is advisable?
26 matches
Mail list logo