Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jon Rodríguez Aranguren
in this endeavor. Warm regards, Jon Rodríguez Aranguren. El sáb, 30 sept 2023 a las 23:19, Jayabindu Singh () escribió: > Hi Jon, > > Using IAM as suggested by Jorn is the best approach. > We recently moved our spark workload from HDP to Spark on K8 and utilizing > IAM. > It will sa

Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-29 Thread Jon Rodríguez Aranguren
the experiences and knowledge present within this community. Warm regards, Jon

unsubscribe

2020-06-17 Thread Ferguson, Jon
This message is confidential and subject to terms at: https://www.jpmorgan.com/emaildisclaimer including on confidential, privileged or legal entity information, viruses and monitoring of electronic messages. If you are not the intended recipient, please delete this message and notify the

Re: Apache Spark MLIB

2017-02-24 Thread Jon Gregg
ing-guide.html#broadcast-variables> that would allow you to put flagged IP ranges into an array and make that available on every node. Then you can filters to detect users who've logged in from a flagged IP range. Jon Gregg On Thu, Feb 23, 2017 at 9:19 PM, Mina Aslani <aslanim...@gmail.c

Re: Spark executors in streaming app always uses 2 executors

2017-02-22 Thread Jon Gregg
ot; A fix might be as simple as switching to the direct approach <https://spark.apache.org/docs/2.1.0/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers> ? Jon Gregg On Wed, Feb 22, 2017 at 12:37 AM, satishl <satish.la...@gmail.com> wrote: > I am reading fro

Re: Query data in subdirectories in Hive Partitions using Spark SQL

2017-02-18 Thread Jon Gregg
ories, and then use Spark SQL to query that Hive table. There might be a cleaner way to do this in Spark 2.0+ but that is a common pattern for me in Spark 1.6 when I know the directory structure but don't have "=" signs in the paths. Jon Gregg On Fri, Feb 17, 2017 at 7:02 PM, 颜

Re: skewed data in join

2017-02-17 Thread Jon Gregg
It depends how you salt it. See slide 40 and onwards from a spark summit talk here: http://www.slideshare.net/cloudera/top-5-mistakes- to-avoid-when-writing-apache-spark-applications The speakers use a mod8 integer salt appended to the end of the key, the salt that works best for you might be

Re: notebook connecting Spark On Yarn

2017-02-15 Thread Jon Gregg
of your cluster size, and make it so users can manually ask for more if they need to. It doesn't take a whole lot of workers/memory to build most of your spark code off a sample. Jon On Wed, Feb 15, 2017 at 6:41 AM, Sachin Aggarwal <different.sac...@gmail.com > wrote: > Hi, > > I am t

Re: Order of rows not preserved after cache + count + coalesce

2017-02-13 Thread Jon Gregg
Spark has a zipWithIndex function for RDDs ( http://stackoverflow.com/a/26081548) that adds an index column right after you create an RDD, and I believe it preserves order. Then you can sort it by the index after the cache step. I haven't tried this with a Dataframe but this answer seems

Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-13 Thread Jon Gregg
s possible to filter your tables down before the join (keeping just the rows/columns you need), that may be a better solution. Jon On Mon, Feb 13, 2017 at 5:27 AM, nancy henry <nancyhenry6...@gmail.com> wrote: > Hi All,, > > I am getting below error while I am trying to join 3

Re: does persistence required for single action ?

2017-02-08 Thread Jon Gregg
some processing to create and I might want to use rdd6 for more analyses in the future. Jon On Wed, Feb 8, 2017 at 1:40 AM, Jörn Franke <jornfra...@gmail.com> wrote: > Depends on the use case, but a persist before checkpointing can make sense > after some of the map steps. >

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Jon Gregg
workaround that might work as well: http://michaelryanbell.com/processing-whole-files-spark-s3.html Jon On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com> wrote: > I've actually been able to trace the problem to the files being read in. > If I change to a differe

Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread Jon Gregg
Confirming that Spark can read newly created views - I just created a test view in HDFS and I was able to query it in Spark 1.5 immediately after without a refresh. Possibly an issue with your Spark-Hive connection? Jon On Sun, Feb 5, 2017 at 9:31 PM, KhajaAsmath Mohammed < mdkhaja

Re: Spark Python in Jupyter Notebook

2017-01-05 Thread Jon G
I don't use MapR but I use pyspark with jupyter, and this MapR blogpost looks similar to what I do to setup: https://community.mapr.com/docs/DOC-1874-how-to-use-jupyter-pyspark-on-mapr On Thu, Jan 5, 2017 at 3:05 AM, neil90 wrote: > Assuming you don't have your

Re: Starting a new Spark codebase, Python or Scala / Java?

2016-11-21 Thread Jon Gregg
Spark is written in Scala, so yes it's still the strongest option. You also get the Dataset type with Scala (compile time type-safety), and that's not an available feature with Python. That said, I think the Python API is a viable candidate if you use Pandas for Data Science. There are

Re: How do I access the nested field in a dataframe, spark Streaming app... Please help.

2016-11-20 Thread Jon Gregg
In these cases it might help to just flatten the DataFrame. Here's a helper function from the tutorial (scroll down to the "Flattening" header:

Re: Spark AVRO S3 read not working for partitioned data

2016-11-17 Thread Jon Gregg
Making a guess here: you need to add s3:ListBucket? http://stackoverflow.com/questions/35803808/spark-saveastextfile-to-s3-fails On Thu, Nov 17, 2016 at 2:11 PM, Jain, Nishit wrote: > When I read a specific file it works: > > val filePath=

Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Jon Gregg
Since you're completely new to Kafka, I would start with the Kafka docs ( https://kafka.apache.org/documentation). You should be able to get through the Getting Started part easily and there are some examples for setting up a basic Kafka server. You don't need Kafka to start working with Spark

Re: Newbie question - Best way to bootstrap with Spark

2016-11-14 Thread Jon Gregg
Piggybacking off this - how are you guys teaching DataFrames and Datasets to new users? I haven't taken the edx courses but I don't see Spark SQL covered heavily in the syllabus. I've dug through the Databricks documentation but it's a lot of information for a new user I think - hoping there is

Re: Cumulative Sum function using Dataset API

2016-08-09 Thread Jon Barksdale
Cool, learn something new every day. Thanks again. On Tue, Aug 9, 2016 at 4:08 PM ayan guha <guha.a...@gmail.com> wrote: > Thanks for reporting back. Glad it worked for you. Actually sum with > partitioning behaviour is same in oracle too. > On 10 Aug 2016 03:01, "Jon Ba

Re: Cumulative Sum function using Dataset API

2016-08-09 Thread Jon Barksdale
:) Thank you both for your help, Jon On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh <santosh.akhil...@huawei.com> wrote: > You could check following link. > > > http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark > > > > *From:* Jo

Re: Cumulative Sum function using Dataset API

2016-08-08 Thread Jon Barksdale
that make sense? Naturally, if ordering a sum turns it into a cumulative sum, I'll gladly use that :) Jon On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.a...@gmail.com> wrote: > You mean you are not able to use sum(col) over (partition by key order by > some_col) ? > > On Tue, Aug

Cumulative Sum function using Dataset API

2016-08-08 Thread jon
time reading through the code, so I don't know how easy or hard that would be. TLDR; What's the best way to write a function that returns a value for every row, but has mutable state, and gets row in a specific order? Does anyone have any ideas, or examples? Thanks, Jon -- View this message

Extremely slow shuffle writes and large job time fluxuations

2016-07-19 Thread Jon Chase
I'm running into an issue with a pyspark job where I'm sometimes seeing extremely variable job times (20min to 2hr) and very long shuffle times (e.g. ~2 minutes for 18KB/86 records). Cluster set up is Amazon EMR 4.4.0, Spark 1.6.0, an m4.2xl driver and a single m4.10xlarge (40 vCPU, 160GB)

Re: A number of issues when running spark-ec2

2016-04-16 Thread Jon Gregg
That link points to hadoop2.6.tgz. I tried changing the URL to https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz and I get a NoSuchKey error. Should I just go with it even though it says hadoop2.6? On Sat, Apr 16, 2016 at 5:37 PM, Ted Yu wrote:

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-12 Thread Jon Kjær Amundsen
https://issues.apache.org/jira/browse/YARN-4714 for further info / Jon / jokja Venlig hilsen/Best regards Jon Kjær Amundsen Information Architect & Product Owner Phone: +45 7023 9080 Direct: +45 8882 1331 E-mail: j...@udbudsvagten.dk Web: www.udbudsvagten.dk Nitivej 10 | DK - 2000 Frederiks

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-07 Thread Jon Gregg
I'm working with a Hadoop distribution that doesn't support 1.5 yet, we'll be able to upgrade in probably two months. For now I'm seeing the same issue with spark not recognizing an existing column name in many hive-table-to-dataframe situations: Py4JJavaError: An error occurred while calling

Re: Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-11-05 Thread Jon Gregg
Here's my code: my_data = sqlCtx.sql("SELECT * FROM raw.site_activity_data LIMIT 2") my_data.collect() raw.site_activity_data is a Hive external table atop daily-partitioned .gzip data. When I execute the command I start seeing many of these pop up in the logs (below is a small subset)

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Jon Gregg
1.3 on cdh 5.4.4 ... I'll take the responses to mean that the fix will be probably a few months away for us. Not a huge problem but something I've run into a number of times. On Tue, Oct 20, 2015 at 3:01 PM, Yin Huai wrote: > btw, what version of Spark did you use? > > On

Re: About memory leak in spark 1.4.1

2015-09-28 Thread Jon Chase
I'm seeing a similar (same?) problem on Spark 1.4.1 running on Yarn (Amazon EMR, Java 8). I'm running a Spark Streaming app 24/7 and system memory eventually gets exhausted after about 3 days and the JVM process dies with: # # There is insufficient memory for the Java Runtime Environment to

Re: collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread Jon Gregg
We did have 2.7 on the driver, 2.6 on the edge nodes and figured that was the issue, so we've tried many combinations since then with all three of 2.6.6, 2.7.5, and Anaconda's 2.7.10 on each node with different PATHs and PYTHONPATHs each time. Every combination has produced the same error. We

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
I should note that the amount of data in each batch is very small, so I'm not concerned with performance implications of grouping into a single RDD. On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase jon.ch...@gmail.com wrote: I'm currently doing something like this in my Spark Streaming program (Java

Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
I'm currently doing something like this in my Spark Streaming program (Java): dStream.foreachRDD((rdd, batchTime) - { log.info(processing RDD from batch {}, batchTime); // my rdd processing code }); Instead of having my

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Jon Walton
individual queries against the large fact table and union the results.Does this sound like a worthwhile approach? Thank you, Jon

Spark SQL and Skewed Joins

2015-06-12 Thread Jon Walton
tables? Is this even feasible? Do table broadcasts wind up in the heap or in dedicated storage space? Thanks for your help, Jon

Re: RDD collect hangs on large input data

2015-04-07 Thread Jon Chase
Zsolt - what version of Java are you running? On Mon, Mar 30, 2015 at 7:12 AM, Zsolt Tóth toth.zsolt@gmail.com wrote: Thanks for your answer! I don't call .collect because I want to trigger the execution. I call it because I need the rdd on the driver. This is not a huge RDD and it's not

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
(), which always assumes the underlying SQL array is represented by a Scala Seq. Would you mind to open a JIRA ticket for this? Thanks! Cheng On 3/27/15 7:00 PM, Jon Chase wrote: Spark 1.3.0 Two issues: a) I'm unable to get a lateral view explode query to work on an array type b) I'm

Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
Spark 1.3.0 Two issues: a) I'm unable to get a lateral view explode query to work on an array type b) I'm unable to save an array type to a Parquet file I keep running into this: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq Here's a stack trace from the

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
you mind to also provide the full stack trace of the exception thrown in the saveAsParquetFile call? Thanks! Cheng On 3/27/15 7:35 PM, Jon Chase wrote: https://issues.apache.org/jira/browse/SPARK-6570 I also left in the call to saveAsParquetFile(), as it produced a similar exception

Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase
library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui acls

Re: Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase
I've filed this as https://issues.apache.org/jira/browse/SPARK-6554 On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase jon.ch...@gmail.com wrote: Spark 1.3.0, Parquet I'm having trouble referencing partition columns in my queries. In the following example, 'probeTypeId' is a partition column

Spark SQL queries hang forever

2015-03-26 Thread Jon Chase
Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB), executor memory 20GB, driver memory 10GB I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread out over roughly 2,000 Parquet files and my queries frequently hang. Simple queries like select count(*) from

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-24 Thread Jon Chase
Shahab - This should do the trick until Hao's changes are out: sqlContext.sql(create temporary function foobar as 'com.myco.FoobarUDAF'); sqlContext.sql(select foobar(some_column) from some_table); This works without requiring to 'deploy' a JAR with the UDAF in it - just make sure the UDAF

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
ideas? Jon 15/02/10 12:06:16 INFO MemoryStore: MemoryStore started with capacity 1177.8 MB. 15/02/10 12:06:16 INFO ConnectionManager: Bound socket to port 30129 with id = ConnectionManagerId(phd40010008.na.com,30129) 15/02/10 12:06:16 INFO BlockManagerMaster: Trying to register BlockManager 15/02

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
: Is the SparkContext you're using the same one that the StreamingContext wraps? If not, I don't think using two is supported. -Sandy On Tue, Feb 10, 2015 at 9:58 AM, Jon Gregg jonrgr...@gmail.com wrote: I'm still getting an error. Here's my code, which works successfully when tested using

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
wrote: You should be able to replace that second line with val sc = ssc.sparkContext On Tue, Feb 10, 2015 at 10:04 AM, Jon Gregg jonrgr...@gmail.com wrote: They're separate in my code, how can I combine them? Here's what I have: val sparkConf = new SparkConf() val ssc = new

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-06 Thread Jon Gregg
OK I tried that, but how do I convert an RDD to a Set that I can then broadcast and cache? val badIPs = sc.textFile(hdfs:///user/jon/+ badfullIPs.csv) val badIPsLines = badIPs.getLines val badIpSet = badIPsLines.toSet val badIPsBC = sc.broadcast(badIpSet) produces

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Jon Chase
I've had a lot of difficulties with using the s3:// prefix. s3n:// seems to work much better. Can't find the link ATM, but seems I recall that s3:// (Hadoop's original block format for s3) is no longer recommended for use. Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the

Re: JavaRDD (Data Aggregation) based on key

2014-12-23 Thread Jon Chase
Have a look at RDD.groupBy(...) and reduceByKey(...) On Tue, Dec 23, 2014 at 4:47 AM, sachin Singh sachin.sha...@gmail.com wrote: Hi, I have a csv file having fields as a,b,c . I want to do aggregation(sum,average..) based on any field(a,b or c) as per user input, using Apache Spark Java

Re: S3 files , Spark job hungsup

2014-12-23 Thread Jon Chase
http://www.jets3t.org/toolkit/configuration.html Put the following properties in a file named jets3t.properties and make sure it is available during the running of your Spark job (just place it in ~/ and pass a reference to it when calling spark-submit with --file ~/jets3t.properties)

Re: Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-23 Thread Jon Chase
/ On Thu, Dec 18, 2014 at 5:56 AM, Jon Chase jon.ch...@gmail.com wrote: I'm running a very simple Spark application that downloads files from S3, does a bit of mapping, then uploads new files. Each file is roughly 2MB and is gzip'd. I was running the same code on Amazon's EMR w/Spark

Re: Fetch Failure

2014-12-19 Thread Jon Chase
I'm getting the same error (ExecutorLostFailure) - input RDD is 100k small files (~2MB each). I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...). Depending on the memory settings given to spark-submit, the time before the first ExecutorLostFailure varies (more memory ==

Re: Fetch Failure

2014-12-19 Thread Jon Chase
time ago /Raf sandy.r...@cloudera.com wrote: Hi Jon, The fix for this is to increase spark.yarn.executor.memoryOverhead to something greater than it's default of 384. This will increase the gap between the executors heap size and what it requests from yarn. It's required because jvms

Re: Fetch Failure

2014-12-19 Thread Jon Chase
7392K, reserved 1048576K On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote: I'm actually already running 1.1.1. I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no luck. Still getting ExecutorLostFailure (executor lost). On Fri, Dec 19, 2014

Re: Fetch Failure

2014-12-19 Thread Jon Chase
Yes, same problem. On Fri, Dec 19, 2014 at 11:29 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Do you hit the same errors? Is it now saying your containers are exceed ~10 GB? On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote: I'm actually already running 1.1.1. I

Yarn not running as many executors as I'd like

2014-12-19 Thread Jon Chase
Running on Amazon EMR w/Yarn and Spark 1.1.1, I have trouble getting Yarn to use the number of executors that I specify in spark-submit: --num-executors 2 In a cluster with two core nodes will typically only result in one executor running at a time. I can play with the memory settings and

Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-18 Thread Jon Chase
I'm running a very simple Spark application that downloads files from S3, does a bit of mapping, then uploads new files. Each file is roughly 2MB and is gzip'd. I was running the same code on Amazon's EMR w/Spark and not having any download speed issues (Amazon's EMR provides a custom

Spark cluster with Java 8 using ./spark-ec2

2014-11-25 Thread Jon Chase
I'm trying to use the spark-ec2 command to launch a Spark cluster that runs Java 8, but so far I haven't been able to get the Spark processes to use the right JVM at start up. Here's the command I use for launching the cluster. Note I'm using the user-data feature to install Java 8: ./spark-ec2

Re: distributing Scala Map datatypes to RDD

2014-10-16 Thread Jon Massey
Wow, it really was that easy! The implicit joining works a treat. Many thanks, Jon On 13 October 2014 22:58, Stephen Boesch java...@gmail.com wrote: is the following what you are looking for? scala sc.parallelize(myMap.map{ case (k,v) = (k,v) }.toSeq) res2: org.apache.spark.rdd.RDD

Re: how to get actual count from as long from JavaDStream ?

2014-09-30 Thread Jon Gregg
there is a foreachRDD() method that allows you to do things like: msgConvertedToDStream.foreachRDD(rdd = println(The count is: + rdd.count().toInt)) The syntax for the casting should be changed for Java and probably the function argument syntax is wrong too, but hopefully there's enough there to help. Jon On Tue

Spark / YARN classpath issues

2014-05-22 Thread Jon Bender
in yarn.application.classpath or should the spark client take care of ensuring the necessary JARs are added during job submission? Any tips would be greatly appreciated! Cheers, Jon

Re: Spark / YARN classpath issues

2014-05-22 Thread Jon Bender
\ --master-memory 4g \ --worker-memory 2g \ --worker-cores 1 I'll make sure to use spark-submit from here on out. Thanks very much! Jon On Thu, May 22, 2014 at 12:40 PM, Andrew Or and...@databricks.com wrote: Hi Jon, Your configuration looks largely correct. I have very