Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Joe
line of code running on different executors. Or is this assumption wrong? Thanks, Joe On Thu, 2022-04-21 at 09:14 -0500, Sean Owen wrote: > A job can have multiple stages for sure. One action triggers a job. > This seems normal.  > > On Thu, Apr 21, 2022, 9:10 AM Joe wrote: >

Why is spark running multiple stages with the same code line?

2022-04-21 Thread Joe
are persisted. I'm not sure if EMR UI display is wrong or if spark stages are not what I thought they were? Thanks, Joe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Apache Spark Meetup - Wednesday 1st July

2020-06-30 Thread Joe Davies
the presentation. Here is the link - https://www.meetup.com/OrbisConnect/events/271400656/<https://protect-eu.mimecast.com/s/SejaCv8kPtA0pDfQdkZv?domain=meetup.com/> Thanks, Joe Davies Senior Consultant [cid:image001.jpg@01D36AC0.7B5BDF80] Direct: +44 (0)203 854 0015 Mobile: +44 (0)7391 650 347 2

subscribe

2020-02-05 Thread Cool Joe
subscribe

Low-level behavior of Exchange

2019-10-30 Thread Joe Naegele
, d: Long), where ‘a’-‘d’ are randomly-generated values. 1.5+ TB total size, Parquet format * The job is as simple as `spark.read.parquet(“input.dat”).repartition(N, “a”, “b”, “c”, “d”).write.parquet(“output.dat”), where N is roughly (input_data_size / 128 MB). Thanks! --- Joe Naegele Grier

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-10 Thread Joe Ammann
sy, maybe you could share logical/physical > plan on any batch: "detail" in SQL tab would show up the plan as string. > Plans from sequential batches would be much helpful - and streaming query > status in these batch (especially watermark) should be h

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-05 Thread Joe Ammann
inner join with 2 left outer only produces results where both outer have a match - inner join with 1 left outer followed by aggregation only produces the messages with a match Of course, all are stream-stream joins CU, Joe On Wednesday, June 5, 2019 09:17 CEST, Jungtaek Lim wrote: > I

Spark structured streaming leftOuter join not working as I expect

2019-06-04 Thread Joe Ammann
l join on records on topic C. The other are not shown on the output. I already tried * make sure that the optional FK on topic B is never null, by using an NVL2(C_FK, C_FK, '') * widen the time window join on the leftOuter to "B_LAST_MOD < C_LAST_LAST_MOD - interval 5

Watermark handling on initial query start (Structured Streaming)

2019-05-20 Thread Joe Ammann
e comes in, after the start of the query, then the watermark is moved ahead and all the messages are produced. Is this the expected behaviour, and my assumption is wrong? Am I doing something wrong during query setup? -- CU, Joe --

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
t;sharedId", window("20 seconds", "10 seconds") > > // ProcessingTime trigger with two-seconds micro-batch interval > > |df.writeStream .format("console") .trigger(Trigger.ProcessingTime("2 > seconds")) .start()| > >

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi Anastasios On 5/14/19 4:15 PM, Anastasios Zouzias wrote: > Hi Joe, > > How often do you trigger your mini-batch? Maybe you can specify the trigger > time explicitly to a low value or even better set it off. > > See: > https://spark.apache.org/docs/latest/structured-

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi Suket Sorry, this was a typo in the pseudo-code I sent. Of course that what you suggested (using the same eventtime attribute for the watermark and the window) is what my code does in reality. Sorry, to confuse people. On 5/14/19 4:14 PM, suket arora wrote: > Hi Joe, > As per the

Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
comes in. This is not what I intended. Is my understanding more or less correct? And is there any way of bringing "the real time" into the calculation of the watermark (short of producing regular dummy messages which are then again filtered out). -- CU, Joe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark structured streaming watermarks on nested attributes

2019-05-07 Thread Joe Ammann
Hi Yuanjian On 5/7/19 4:55 AM, Yuanjian Li wrote: > Hi Joe > > I think you met this issue: https://issues.apache.org/jira/browse/SPARK-27340 > You can check the description in Jira and PR. We also met this in our > production env and fixed by the providing PR. > > The PR i

Re: Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Joe Ammann
hen I nest the attributes and use "entityX.LAST_MODIFICATION" (notice the dot for the nesting) the joins fail. I have a feeling that the Spark execution plan get's somewhat confused, because in the latter case, there are multiple fields called "LAST_MODIFICATION" with differing nes

Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Joe Ammann
;).alias("entity1_LAST_MODIFICATION")).withWatermark("entity1_LAST_MODIFICATION") Before I hunt this down any further, is this kind of a known limitation? Or am I doing something fundamentally wrong? -- CU, Joe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

question about barrier execution mode in Spark 2.4.0

2018-11-12 Thread Joe
a lot, Joe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: writing to local files on a worker

2018-11-11 Thread Joe
ine all the data before calling your app then you could do it. Or you could split your job into Spark -> app -> Spark chain. Good luck, Joe On 11/11/2018 02:13 PM, Steve Lewis wrote: I have a problem where a critical step needs to be performed by  a third party c++ application. I

How does shuffle operation work in Spark?

2018-11-07 Thread Joe
(or a detailed answer). Thanks a lot, Joe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark SQL bucket pruning support

2018-01-22 Thread Joe Wang
om/apache/spark/pull/12300>, and the logic in the BucketedReadSuite to verify that pruned buckets are empty is currently commented out <https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala#L114> . Thanks, Joe

share datasets across multiple spark-streaming applications for lookup

2017-10-30 Thread roshan joe
Hi, What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset? The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario. Streaming App-1 consumes data from

Re: spark-submit question

2017-02-28 Thread Joe Olson
not found ./test3.sh: line 16: --second-argument=Arg2: command not found From: Marcelo Vanzin <van...@cloudera.com> Sent: Tuesday, February 28, 2017 12:17:49 PM To: Joe Olson Cc: user@spark.apache.org Subject: Re: spark-submit question Everything after the jar path is passe

spark-submit question

2017-02-28 Thread Joe Olson
For spark-submit, I know I can submit application level command line parameters to my .jar. However, can I prefix them with switches? My command line params are processed in my applications using JCommander. I've tried several variations of the below with no success. An example of what I am

Re: Fatal error when using broadcast variables and checkpointing in Spark Streaming

2016-07-22 Thread Joe Panciera
l events num = events.value print num events.unpersist() events = sc.broadcast(num + 1) alert_stream.foreachRDD(test) # Comment this line and no error occurs ssc.checkpoint('dir') ssc.start() ssc.awaitTermination() On Fri, Jul 22, 2016 at 1:50 PM, Joe Panciera <joe.panci...@gm

Fatal error when using broadcast variables and checkpointing in Spark Streaming

2016-07-22 Thread Joe Panciera
Hi, I'm attempting to use broadcast variables to update stateful values used across the cluster for processing. Essentially, I have a function that is executed in .foreachRDD that updates the broadcast variable by calling unpersist() and then rebroadcasting. This works without issues when I

Using multiple data sources in one stream

2016-07-20 Thread Joe Panciera
Hi, I have a rather complicated situation thats raised an issue regarding consuming multiple data sources for processing. Unlike the use cases I've found, I have 3 sources of different formats. There's one 'main' stream A that does the processing, and 2 sources B and C that provide elements

Re: Standalone cluster node utilization

2016-07-14 Thread Zhou (Joe) Xing
i have seen similar behavior in my standalone cluster, I tried to increase the number of partitions and at some point it seems all the executors or worker nodes start to make parallel connection to remote data store. But it would be nice if someone could point us to some references on how to

Re: Matrix Factorization Model model.save error "NullPointerException"

2016-07-12 Thread Zhou (Joe) Xing
Anyone may have an idea on what this NPE issue below is about? Thank you! cheers zhou On Jul 11, 2016, at 11:27 PM, Zhou (Joe) Xing <joe.x...@nextev.com<mailto:joe.x...@nextev.com>> wrote: Hi Guys, I searched for the archive and also googled this problem when saving the ALS tr

Matrix Factorization Model model.save error "NullPointerException"

2016-07-12 Thread Zhou (Joe) Xing
Hi Guys, I searched for the archive and also googled this problem when saving the ALS trained Matrix Factorization Model to local file system using Model.save() method, I found some hints such as partition the model before saving, etc. But it does not seem to solve my problem. I’m always

Re: Variable in UpdateStateByKey Not Updating After Restarting Application from Checkpoint

2016-06-09 Thread Joe Panciera
On Wed, Jun 8, 2016 at 1:27 PM, Joe Panciera <joe.panci...@gmail.com> wrote: > I've run into an issue where a global variable used within an > UpdateStateByKey function isn't being assigned after the application > restarts from a checkpoint. Using ForEachRDD I have a

Variable in UpdateStateByKey Not Updating After Restarting Application from Checkpoint

2016-06-08 Thread Joe Panciera
I've run into an issue where a global variable used within an UpdateStateByKey function isn't being assigned after the application restarts from a checkpoint. Using ForEachRDD I have a global variable 'A' that is propagated from a file every time a batch runs, and A is then used in an

Choosing an Algorithm in Spark MLib

2016-04-12 Thread Joe San
up vote down votefavorite I'm working on a problem where in I have some data sets about some power generating units. Each of these units have been activated to run in the past

Spark GraphX + TitanDB + Cassandra?

2016-01-26 Thread Joe Bako
/Gremlin even be a consideration? The underlying data model that Titan uses in Cassandra does not seem accessible for direct querying via CQL/Thrift. Any guidance around this nebulous subject is much appreciated! Joe Bako Software Architect Gracenote, Inc. Mobile: 925.818.2230 http

Re: Spark dramatically slow when I add saveAsTextFile

2015-05-24 Thread Joe Wass
of 'is this taking too long', I can't answer that. But your code was HTML escaped and therefore difficult to read, perhaps you should post a link to a Gist. Joe On 24 May 2015 at 10:36, allanjie allanmcgr...@gmail.com wrote: *Problem Description*: The program running in stand-alone spark cluster (1 master

Re: Is anyone using Amazon EC2?

2015-05-23 Thread Joe Wass
AM, Johan Beisser j...@caustic.org wrote: Yes. We're looking at bootstrapping in EMR... On Sat, May 23, 2015 at 07:21 Joe Wass jw...@crossref.org wrote: I used Spark on EC2 a while ago

Is anyone using Amazon EC2?

2015-05-23 Thread Joe Wass
I used Spark on EC2 a while ago

Running out of space (when there's no shortage)

2015-02-24 Thread Joe Wass
: java.io.IOException: No space left on device These both happen after several hours and several GB of temporary files. Why does Spark think it's run out of space? TIA Joe Stack trace 1: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

Re: Unzipping large files and 2GB partition size.

2015-02-19 Thread Joe Wass
. I'll give them a go (although it's not trivial to re-encode 200 GB of data on S3, so if I can get this working reasonably with gzip I'd like to). Any advice about whether this error can be worked round with an early partition? Cheers Joe On 19 February 2015 at 09:51, Sean Owen so

Unzipping large files and 2GB partition size.

2015-02-19 Thread Joe Wass
to over 2GB? Does anyone have any experience with this? Thanks in advance Joe Stack trace: Exception in thread main 15/02/18 20:44:25 INFO scheduler.TaskSetManager: Lost task 5.3 in stage 1.0 (TID 283) on executor: java.lang.IllegalArgumentException (Size exceeds Integer.MAX_VALUE) [duplicate 6

Re: Unzipping large files and 2GB partition size.

2015-02-19 Thread Joe Wass
concatenated input files (probably about 800 GB)? In that case should I multiply it by the number of files? Or perhaps I'm barking up completely the wrong tree. Joe On 19 February 2015 at 14:44, Imran Rashid iras...@cloudera.com wrote: Hi Joe, The issue is not that you have input partitions

Has Spark 1.2.0 changed EC2 persistent-hdfs?

2015-02-13 Thread Joe Wass
something must have changed in the last couple of weeks (last time I was using 1.1.0). Is this a bug? Has the behaviour of AWS changed? Am I doing something stupid? How do I fix it? Thanks in advance! Joe

Re: Has Spark 1.2.0 changed EC2 persistent-hdfs?

2015-02-13 Thread Joe Wass
Looks like this is caused by issue SPARK-5008: https://issues.apache.org/jira/browse/SPARK-5008 On 13 February 2015 at 19:04, Joe Wass jw...@crossref.org wrote: I've updated to Spark 1.2.0 and the EC2 and the persistent-hdfs behaviour appears to have changed. My launch script is spark

How do I set spark.local.dirs?

2015-02-06 Thread Joe Wass
have checked that these values are present in nodes. But it's still creating temp files in the wrong (default) place: /mnt2/spark How do I get my slaves to pick up this value? How can I verify that they have? Thanks! Joe

Re: How many stages in my application?

2015-02-05 Thread Joe Wass
the webui you will be able to see how many operations have happened so far. Thanks Best Regards On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote: I'm sitting here looking at my application crunching gigabytes of data on a cluster and I have no idea if it's an hour away from

How many stages in my application?

2015-02-04 Thread Joe Wass
. Did I miss something obvious? Thanks! Joe

ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass
?) This seems like a common use-case, so sorry if this has already been covered. Joe

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass
of the cluster needed to process the data from the size of the data. DR On 02/03/2015 11:43 AM, Joe Wass wrote: I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need to store the input in HDFS somehow. I currently have a cluster of 5 x m3.xlarge, each of which has 80GB

Kyro serialization and OOM

2015-02-03 Thread Joe Wass
serialization needs as I understand them. Have I misunderstood? Joe Stack trace: INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 30.0 (TID 116, localhost, PROCESS_LOCAL, 993 bytes) INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 30.0 (TID 116

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass
Thanks very much, that's good to know, I'll certainly give it a look. Can you give me a hint about you unzip your input files on the fly? I thought that it wasn't possible to parallelize zipped inputs unless they were unzipped before passing to Spark? Joe On 3 February 2015 at 17:48, David

Re: PermGen issues on AWS

2015-01-09 Thread Joe Wass
generation, so this particular type of problem and tuning is not needed. You might consider running on Java 8. On Fri, Jan 9, 2015 at 10:38 AM, Joe Wass jw...@crossref.org wrote: I'm running on an AWS cluster of 10 x m1.large (64 bit, 7.5 GiB RAM). FWIW I'm using the Flambo Clojure wrapper which

Accidental kill in UI

2015-01-09 Thread Joe Wass
So I had a Spark job with various failures, and I decided to kill it and start again. I clicked the 'kill' link in the web console, restarted the job on the command line and headed back to the web console and refreshed to see how my job was doing... the URL at the time was:

PermGen issues on AWS

2015-01-09 Thread Joe Wass
the extraJavaOptions and memoryOverhead? Thanks very much! Joe

Are failures normal / to be expected on an AWS cluster?

2014-12-20 Thread Joe Wass
, but CANNOT FIND ADDRESS for half of the executors. Are these numbers normal for AWS? Should a certain number of faults be expected? I know that AWS isn't meant to be perfect, but this doesn't seem right. Cheers Joe

GC problem while filtering large data

2014-12-16 Thread Joe L
Hi I am trying to filter large table with 3 columns. Spark SQL might be a good choice but want to do it without SQL. The goal is to filter bigtable with multi clauses. I filtered bigtable 3times but the first filtering takes about 50seconds but the second and third filter transformation took about

classnotfound error due to groupByKey

2014-07-04 Thread Joe L
Hi, When I run the following a piece of code, it is throwing a classnotfound error. Any suggestion would be appreciated. Wanted to group an RDD by key: val t = rdd.groupByKey() Error message: java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$

ec2 deployment regions supported

2014-06-07 Thread Joe Mathai
Hi , I am interested in deploying spark 1.0.0 on ec2 and wanted to know which all regions are supported.I have been able to deploy the previous version in east but i had a hard time launching the cluster due to bad connection the script provided would fail to ssh into a node after a couple of

Need equallyWeightedPartitioner Algorithm

2014-06-03 Thread Joe L
I need to partition my data into the same weighted partitions, suppose I have 20GB data and I want 4 partitions where each partition has 5GB of the data. Thanks -- View this message in context:

Map failed [dupliacte 1] error

2014-05-27 Thread Joe L
Hi, I am getting the following error but I don't understand what the problem is. 14/05/27 17:44:29 INFO TaskSetManager: Loss was due to java.io.IOException: Map failed [duplicate 15] 14/05/27 17:44:30 INFO TaskSetManager: Starting task 47.0:43 as TID 60281 on executor 0: cm07 (PROCESS_LOCAL)

facebook data mining with Spark

2014-05-19 Thread Joe L
Is there any way to get facebook data into Spark and filter the content of it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

ClassNotFoundException

2014-05-01 Thread Joe L
Hi, I am getting the following error. How could I fix this problem? Joe 14/05/02 03:51:48 WARN TaskSetManager: Lost TID 12 (task 2.0:1) 14/05/02 03:51:48 INFO TaskSetManager: Loss was due to java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4

help

2014-04-27 Thread Joe L
I am getting this error, please help me to fix it 4/04/28 02:16:20 INFO SparkDeploySchedulerBackend: Executor app-20140428021620-0007/10 removed: class java.io.IOException: Cannot run program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in directory .): error=13, -- View this

read file from hdfs

2014-04-25 Thread Joe L
I have just 2 two questions? sc.textFile(hdfs://host:port/user/matei/whatever.txt) Is host master node? What port we should use? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/read-file-from-hdfs-tp4824.html Sent from the Apache Spark User List mailing

help

2014-04-25 Thread Joe L
I need someone's help please I am getting the following error. [error] 14/04/26 03:09:47 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140426030946-0004/8 removed: class java.io.IOException: Cannot run program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in directory

Re: help

2014-04-25 Thread Joe L
hi thank you for your reply but I could not find it. it says that no such file or directory http://apache-spark-user-list.1001560.n3.nabble.com/file/n4848/Capture.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html Sent from the

help me

2014-04-22 Thread Joe L
I got the following performance is it normal in spark to be like this. some times spark switchs into node_local mode from process_local and it becomes 10x faster. I am very confused. scala val a = sc.textFile(/user/exobrain/batselem/LUBM1000) scala f.count() Long = 137805557 took 130.809661618 s

Re: Spark is slow

2014-04-21 Thread Joe L
g1 = pairs1.groupByKey().count() pairs1 = pairs1.groupByKey(g1).cache() g2 = triples.groupByKey().count() pairs2 = pairs2.groupByKey(g2) pairs = pairs2.join(pairs1) Hi, I want to implement hash-partitioned joining as shown above. But somehow, it is taking so long to perform. As I

evaluate spark

2014-04-20 Thread Joe L
I want to evaluate spark performance by measuring the running time of transformation operations such as map and join. To do so, do I need to materialize merely count action? because As far as I know, transformations are lazy operations and don't do any computation until we action on them but when

what is a partition? how it works?

2014-04-16 Thread Joe L
do they reside in memory in the cluster? I am sorry for such a simple question but I couldn't find any specific information about what happens underneath partitioning. Thank you, Joe -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-a-partition-how

groupByKey returns a single partition in a RDD?

2014-04-15 Thread Joe L
I want to apply the following transformations to 60Gbyte data on 7nodes with 10Gbyte memory. And I am wondering if groupByKey() function returns a RDD with a single partition for each key? if so, what will happen if the size of the partition doesn't fit into that particular node? rdd =

what is the difference between element and partition?

2014-04-15 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-element-and-partition-tp4317.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

groupByKey(None) returns partitions according to the keys?

2014-04-15 Thread Joe L
I was wonder if groupByKey returns 2 partitions in the below example? x = sc.parallelize([(a, 1), (b, 1), (a, 1)]) sorted(x.groupByKey().collect()) [('a', [1, 1]), ('b', [1])] -- View this message in context:

Proper caching method

2014-04-14 Thread Joe L
Hi I am trying to cache 2Gbyte data and to implement the following procedure. In order to cache them I did as follows: Is it necessary to cache rdd2 since rdd1 is already cached? rdd1 = textFile(hdfs...).cache() rdd2 = rdd1.filter(userDefinedFunc1).cache() rdd3 =

shuffle vs performance

2014-04-14 Thread Joe L
I was wondering less partitioning rdds could help the Spark performance and reduce shuffling? is it true? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/shuffle-vs-performance-tp4255.html Sent from the Apache Spark User List mailing list archive at

how to use a single filter instead of multiple filters

2014-04-13 Thread Joe L
Hi, I have multiple filters as shown below, should I use a single optimal filter instead of them? these filters can degrade the performance of spark? http://apache-spark-user-list.1001560.n3.nabble.com/file/n4185/Capture.png -- View this message in context:

how to count maps without shuffling too much data?

2014-04-13 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-count-maps-without-shuffling-too-much-data-tp4194.html Sent from the Apache Spark User List mailing list archive at Nabble.com.