in this endeavor.
Warm regards,
Jon Rodríguez Aranguren.
El sáb, 30 sept 2023 a las 23:19, Jayabindu Singh ()
escribió:
> Hi Jon,
>
> Using IAM as suggested by Jorn is the best approach.
> We recently moved our spark workload from HDP to Spark on K8 and utilizing
> IAM.
> It will sa
the
experiences and knowledge present within this community.
Warm regards,
Jon
This message is confidential and subject to terms at:
https://www.jpmorgan.com/emaildisclaimer including on confidential, privileged
or legal entity information, viruses and monitoring of electronic messages. If
you are not the intended recipient, please delete this message and notify the
ing-guide.html#broadcast-variables>
that
would allow you to put flagged IP ranges into an array and make that
available on every node. Then you can filters to detect users who've
logged in from a flagged IP range.
Jon Gregg
On Thu, Feb 23, 2017 at 9:19 PM, Mina Aslani <aslanim...@gmail.c
ot;
A fix might be as simple as switching to the direct approach
<https://spark.apache.org/docs/2.1.0/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers>
?
Jon Gregg
On Wed, Feb 22, 2017 at 12:37 AM, satishl <satish.la...@gmail.com> wrote:
> I am reading fro
ories, and then use Spark SQL to
query that Hive table. There might be a cleaner way to do this in Spark
2.0+ but that is a common pattern for me in Spark 1.6 when I know the
directory structure but don't have "=" signs in the paths.
Jon Gregg
On Fri, Feb 17, 2017 at 7:02 PM, 颜
It depends how you salt it. See slide 40 and onwards from a spark summit
talk here: http://www.slideshare.net/cloudera/top-5-mistakes-
to-avoid-when-writing-apache-spark-applications The speakers use a mod8
integer salt appended to the end of the key, the salt that works best for
you might be
of your cluster size,
and make it so users can manually ask for more if they need to. It doesn't
take a whole lot of workers/memory to build most of your spark code off a
sample.
Jon
On Wed, Feb 15, 2017 at 6:41 AM, Sachin Aggarwal <different.sac...@gmail.com
> wrote:
> Hi,
>
> I am t
Spark has a zipWithIndex function for RDDs (
http://stackoverflow.com/a/26081548) that adds an index column right after
you create an RDD, and I believe it preserves order. Then you can sort it
by the index after the cache step.
I haven't tried this with a Dataframe but this answer seems
s possible to
filter your tables down before the join (keeping just the rows/columns you
need), that may be a better solution.
Jon
On Mon, Feb 13, 2017 at 5:27 AM, nancy henry <nancyhenry6...@gmail.com>
wrote:
> Hi All,,
>
> I am getting below error while I am trying to join 3
some processing to create and I
might want to use rdd6 for more analyses in the future.
Jon
On Wed, Feb 8, 2017 at 1:40 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> Depends on the use case, but a persist before checkpointing can make sense
> after some of the map steps.
>
workaround that might
work as well:
http://michaelryanbell.com/processing-whole-files-spark-s3.html
Jon
On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com>
wrote:
> I've actually been able to trace the problem to the files being read in.
> If I change to a differe
Confirming that Spark can read newly created views - I just created a test
view in HDFS and I was able to query it in Spark 1.5 immediately after
without a refresh. Possibly an issue with your Spark-Hive connection?
Jon
On Sun, Feb 5, 2017 at 9:31 PM, KhajaAsmath Mohammed <
mdkhaja
I don't use MapR but I use pyspark with jupyter, and this MapR blogpost
looks similar to what I do to setup:
https://community.mapr.com/docs/DOC-1874-how-to-use-jupyter-pyspark-on-mapr
On Thu, Jan 5, 2017 at 3:05 AM, neil90 wrote:
> Assuming you don't have your
Spark is written in Scala, so yes it's still the strongest option. You
also get the Dataset type with Scala (compile time type-safety), and that's
not an available feature with Python.
That said, I think the Python API is a viable candidate if you use Pandas
for Data Science. There are
In these cases it might help to just flatten the DataFrame. Here's a
helper function from the tutorial (scroll down to the "Flattening" header:
Making a guess here: you need to add s3:ListBucket?
http://stackoverflow.com/questions/35803808/spark-saveastextfile-to-s3-fails
On Thu, Nov 17, 2016 at 2:11 PM, Jain, Nishit
wrote:
> When I read a specific file it works:
>
> val filePath=
Since you're completely new to Kafka, I would start with the Kafka docs (
https://kafka.apache.org/documentation). You should be able to get through
the Getting Started part easily and there are some examples for setting up
a basic Kafka server.
You don't need Kafka to start working with Spark
Piggybacking off this - how are you guys teaching DataFrames and Datasets
to new users? I haven't taken the edx courses but I don't see Spark SQL
covered heavily in the syllabus. I've dug through the Databricks
documentation but it's a lot of information for a new user I think - hoping
there is
Cool, learn something new every day. Thanks again.
On Tue, Aug 9, 2016 at 4:08 PM ayan guha <guha.a...@gmail.com> wrote:
> Thanks for reporting back. Glad it worked for you. Actually sum with
> partitioning behaviour is same in oracle too.
> On 10 Aug 2016 03:01, "Jon Ba
:)
Thank you both for your help,
Jon
On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh <santosh.akhil...@huawei.com>
wrote:
> You could check following link.
>
>
> http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark
>
>
>
> *From:* Jo
that make sense? Naturally, if ordering a sum turns it into a
cumulative sum, I'll gladly use that :)
Jon
On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.a...@gmail.com> wrote:
> You mean you are not able to use sum(col) over (partition by key order by
> some_col) ?
>
> On Tue, Aug
time reading through the code, so I don't know how easy or hard that would
be.
TLDR; What's the best way to write a function that returns a value for every
row, but has mutable state, and gets row in a specific order?
Does anyone have any ideas, or examples?
Thanks,
Jon
--
View this message
I'm running into an issue with a pyspark job where I'm sometimes seeing
extremely variable job times (20min to 2hr) and very long shuffle times
(e.g. ~2 minutes for 18KB/86 records).
Cluster set up is Amazon EMR 4.4.0, Spark 1.6.0, an m4.2xl driver and a
single m4.10xlarge (40 vCPU, 160GB)
That link points to hadoop2.6.tgz. I tried changing the URL to
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz
and I get a NoSuchKey error.
Should I just go with it even though it says hadoop2.6?
On Sat, Apr 16, 2016 at 5:37 PM, Ted Yu wrote:
https://issues.apache.org/jira/browse/YARN-4714 for further info
/ Jon
/ jokja
Venlig hilsen/Best regards
Jon Kjær Amundsen
Information Architect & Product Owner
Phone: +45 7023 9080
Direct: +45 8882 1331
E-mail: j...@udbudsvagten.dk
Web: www.udbudsvagten.dk
Nitivej 10 | DK - 2000 Frederiks
I'm working with a Hadoop distribution that doesn't support 1.5 yet, we'll
be able to upgrade in probably two months. For now I'm seeing the same
issue with spark not recognizing an existing column name in many
hive-table-to-dataframe situations:
Py4JJavaError: An error occurred while calling
Here's my code:
my_data = sqlCtx.sql("SELECT * FROM raw.site_activity_data LIMIT 2")
my_data.collect()
raw.site_activity_data is a Hive external table atop daily-partitioned
.gzip data. When I execute the command I start seeing many of these pop up
in the logs (below is a small subset)
1.3 on cdh 5.4.4 ... I'll take the responses to mean that the fix will be
probably a few months away for us. Not a huge problem but something I've
run into a number of times.
On Tue, Oct 20, 2015 at 3:01 PM, Yin Huai wrote:
> btw, what version of Spark did you use?
>
> On
I'm seeing a similar (same?) problem on Spark 1.4.1 running on Yarn (Amazon
EMR, Java 8). I'm running a Spark Streaming app 24/7 and system memory
eventually gets exhausted after about 3 days and the JVM process dies with:
#
# There is insufficient memory for the Java Runtime Environment to
We did have 2.7 on the driver, 2.6 on the edge nodes and figured that was
the issue, so we've tried many combinations since then with all three of
2.6.6, 2.7.5, and Anaconda's 2.7.10 on each node with different PATHs and
PYTHONPATHs each time. Every combination has produced the same error.
We
I should note that the amount of data in each batch is very small, so I'm
not concerned with performance implications of grouping into a single RDD.
On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase jon.ch...@gmail.com wrote:
I'm currently doing something like this in my Spark Streaming program
(Java
I'm currently doing something like this in my Spark Streaming program
(Java):
dStream.foreachRDD((rdd, batchTime) - {
log.info(processing RDD from batch {}, batchTime);
// my rdd processing code
});
Instead of having my
individual queries against the large fact table and union
the results.Does this sound like a worthwhile approach?
Thank you,
Jon
tables? Is this even feasible? Do table broadcasts wind up in
the heap or in dedicated storage space?
Thanks for your help,
Jon
Zsolt - what version of Java are you running?
On Mon, Mar 30, 2015 at 7:12 AM, Zsolt Tóth toth.zsolt@gmail.com
wrote:
Thanks for your answer!
I don't call .collect because I want to trigger the execution. I call it
because I need the rdd on the driver. This is not a huge RDD and it's not
(), which always assumes the
underlying SQL array is represented by a Scala Seq. Would you mind to open
a JIRA ticket for this? Thanks!
Cheng
On 3/27/15 7:00 PM, Jon Chase wrote:
Spark 1.3.0
Two issues:
a) I'm unable to get a lateral view explode query to work on an array
type
b) I'm
Spark 1.3.0
Two issues:
a) I'm unable to get a lateral view explode query to work on an array type
b) I'm unable to save an array type to a Parquet file
I keep running into this:
java.lang.ClassCastException: [I cannot be cast to
scala.collection.Seq
Here's a stack trace from the
you mind to also provide the full stack
trace of the exception thrown in the saveAsParquetFile call? Thanks!
Cheng
On 3/27/15 7:35 PM, Jon Chase wrote:
https://issues.apache.org/jira/browse/SPARK-6570
I also left in the call to saveAsParquetFile(), as it produced a similar
exception
library
for your platform... using builtin-java classes where applicable
INFO org.apache.spark.SecurityManager Changing view acls to: jon
INFO org.apache.spark.SecurityManager Changing modify acls to: jon
INFO org.apache.spark.SecurityManager SecurityManager: authentication
disabled; ui acls
I've filed this as https://issues.apache.org/jira/browse/SPARK-6554
On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase jon.ch...@gmail.com wrote:
Spark 1.3.0, Parquet
I'm having trouble referencing partition columns in my queries.
In the following example, 'probeTypeId' is a partition column
Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB),
executor memory 20GB, driver memory 10GB
I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread
out over roughly 2,000 Parquet files and my queries frequently hang. Simple
queries like select count(*) from
Shahab -
This should do the trick until Hao's changes are out:
sqlContext.sql(create temporary function foobar as 'com.myco.FoobarUDAF');
sqlContext.sql(select foobar(some_column) from some_table);
This works without requiring to 'deploy' a JAR with the UDAF in it - just
make sure the UDAF
ideas?
Jon
15/02/10 12:06:16 INFO MemoryStore: MemoryStore started with capacity
1177.8 MB.
15/02/10 12:06:16 INFO ConnectionManager: Bound socket to port 30129 with
id = ConnectionManagerId(phd40010008.na.com,30129)
15/02/10 12:06:16 INFO BlockManagerMaster: Trying to register BlockManager
15/02
:
Is the SparkContext you're using the same one that the StreamingContext
wraps? If not, I don't think using two is supported.
-Sandy
On Tue, Feb 10, 2015 at 9:58 AM, Jon Gregg jonrgr...@gmail.com wrote:
I'm still getting an error. Here's my code, which works successfully
when tested using
wrote:
You should be able to replace that second line with
val sc = ssc.sparkContext
On Tue, Feb 10, 2015 at 10:04 AM, Jon Gregg jonrgr...@gmail.com wrote:
They're separate in my code, how can I combine them? Here's what I have:
val sparkConf = new SparkConf()
val ssc = new
OK I tried that, but how do I convert an RDD to a Set that I can then
broadcast and cache?
val badIPs = sc.textFile(hdfs:///user/jon/+ badfullIPs.csv)
val badIPsLines = badIPs.getLines
val badIpSet = badIPsLines.toSet
val badIPsBC = sc.broadcast(badIpSet)
produces
I've had a lot of difficulties with using the s3:// prefix. s3n:// seems
to work much better. Can't find the link ATM, but seems I recall that
s3:// (Hadoop's original block format for s3) is no longer recommended for
use. Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the
Have a look at RDD.groupBy(...) and reduceByKey(...)
On Tue, Dec 23, 2014 at 4:47 AM, sachin Singh sachin.sha...@gmail.com
wrote:
Hi,
I have a csv file having fields as a,b,c .
I want to do aggregation(sum,average..) based on any field(a,b or c) as per
user input,
using Apache Spark Java
http://www.jets3t.org/toolkit/configuration.html
Put the following properties in a file named jets3t.properties and make
sure it is available during the running of your Spark job (just place it in
~/ and pass a reference to it when calling spark-submit with --file
~/jets3t.properties)
/
On Thu, Dec 18, 2014 at 5:56 AM, Jon Chase jon.ch...@gmail.com wrote:
I'm running a very simple Spark application that downloads files from S3,
does a bit of mapping, then uploads new files. Each file is roughly 2MB
and is gzip'd. I was running the same code on Amazon's EMR w/Spark
I'm getting the same error (ExecutorLostFailure) - input RDD is 100k
small files (~2MB each). I do a simple map, then keyBy(), and then
rdd.saveAsHadoopDataset(...). Depending on the memory settings given to
spark-submit, the time before the first ExecutorLostFailure varies (more
memory ==
time ago
/Raf
sandy.r...@cloudera.com wrote:
Hi Jon,
The fix for this is to increase spark.yarn.executor.memoryOverhead to
something greater than it's default of 384.
This will increase the gap between the executors heap size and what it
requests from yarn. It's required because jvms
7392K, reserved
1048576K
On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote:
I'm actually already running 1.1.1.
I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
luck. Still getting ExecutorLostFailure (executor lost).
On Fri, Dec 19, 2014
Yes, same problem.
On Fri, Dec 19, 2014 at 11:29 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:
Do you hit the same errors? Is it now saying your containers are exceed
~10 GB?
On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote:
I'm actually already running 1.1.1.
I
Running on Amazon EMR w/Yarn and Spark 1.1.1, I have trouble getting Yarn
to use the number of executors that I specify in spark-submit:
--num-executors 2
In a cluster with two core nodes will typically only result in one executor
running at a time. I can play with the memory settings and
I'm running a very simple Spark application that downloads files from S3,
does a bit of mapping, then uploads new files. Each file is roughly 2MB
and is gzip'd. I was running the same code on Amazon's EMR w/Spark and not
having any download speed issues (Amazon's EMR provides a custom
I'm trying to use the spark-ec2 command to launch a Spark cluster that runs
Java 8, but so far I haven't been able to get the Spark processes to use
the right JVM at start up.
Here's the command I use for launching the cluster. Note I'm using the
user-data feature to install Java 8:
./spark-ec2
Wow, it really was that easy! The implicit joining works a treat.
Many thanks,
Jon
On 13 October 2014 22:58, Stephen Boesch java...@gmail.com wrote:
is the following what you are looking for?
scala sc.parallelize(myMap.map{ case (k,v) = (k,v) }.toSeq)
res2: org.apache.spark.rdd.RDD
there is a foreachRDD() method that allows you to do things like:
msgConvertedToDStream.foreachRDD(rdd = println(The count is: +
rdd.count().toInt))
The syntax for the casting should be changed for Java and probably the
function argument syntax is wrong too, but hopefully there's enough there
to help.
Jon
On Tue
in yarn.application.classpath or should the
spark client take care of ensuring the necessary JARs are added during job
submission?
Any tips would be greatly appreciated!
Cheers,
Jon
\
--master-memory 4g \
--worker-memory 2g \
--worker-cores 1
I'll make sure to use spark-submit from here on out.
Thanks very much!
Jon
On Thu, May 22, 2014 at 12:40 PM, Andrew Or and...@databricks.com wrote:
Hi Jon,
Your configuration looks largely correct. I have very
62 matches
Mail list logo