Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Femi Anthony
If you're using just Spark you could try turning on the history server and try to glean statistics from there. But there is no one location or log file which stores them all. Databricks, which is a managed Spark solution, provides such

Re: Kube estimate for Spark

2021-06-03 Thread Femi Anthony
I think he’s running Spark on Kubernetes not YARN as cluster manager Sent from my iPhone > On Jun 3, 2021, at 6:05 AM, Mich Talebzadeh wrote: > >  > Please provide the spark version, the environment you are running (on-prem, > cloud etc), state if you are running in YARN etc and your

Re: [External Sender] Memory issues in 3.0.2 but works well on 2.4.4

2021-05-21 Thread Femi Anthony
Post the stack trace and provide some more details about your configuration On Fri, May 21, 2021 at 7:52 AM Praneeth Shishtla wrote: > Hi, > I have a simple DecisionForest model and was able to train the model on > pyspark==2.4.4 without any issues. > However, when I upgraded to pyspark==3.0.2,

Re: error , saving dataframe , LEGACY_PASS_PARTITION_BY_AS_OPTIONS

2019-11-13 Thread Femi Anthony
Can you post the line of code that’s resulting in that error along with the stack trace ? Sent from my iPhone > On Nov 13, 2019, at 9:53 AM, asma zgolli wrote: > >  > Hello , > > I'm using spark 2.4.4 and i keep receiving this error message. Can you please > help me identify the problem?

PySpark with custom transformer project organization

2019-09-23 Thread Femi Anthony
I have a Pyspark project that requires a custom ML Pipeline Transformer written in Scala. What is the best practice regarding project organization ? Should I include the scala files in the general Python project or should they be in a separate repo ? Opinions and suggestions welcome. Sent

Re: [External Sender] Execute Spark model without Spark

2019-08-22 Thread Femi Anthony
Hi you can checkout mLeap - https://github.com/combust/mleap But I must warn you - their support is minimal at best. Femi Sent from my iPhone On Aug 22, 2019, at 1:13 PM, Yeikel wrote: Hi , I have a GBTClassificationModel <

Pass row to UDF and select column based on pattern match

2019-07-09 Thread Femi Anthony
How can I achieve the following by passing a row to a udf ? val df1 = df.withColumn("col_Z", when($"col_x" === "a", $"col_A") .when($"col_x" === "b", $"col_B") .when($"col_x" === "c", $"col_C") .when($"col_x" === "d",

AWS EMR slow write to HDFS

2019-06-11 Thread Femi Anthony
I'm writing a large dataset in Parquet format to HDFS using Spark and it runs rather slowly in EMR vs say Databricks. I realize that if I was able to use Hadoop 3.1, it would be much more performant because it has a high performance output committer. Is this the case, and if so - when will

Re: Writing to multiple Kafka partitions from Spark

2019-05-28 Thread Femi Anthony
> Regards, > Snehasish > > >> On Fri, May 24, 2019 at 9:05 PM Femi Anthony wrote: >> >> >> I have Spark code that writes a batch to Kafka as specified here: >> >> https://spark.apache.org/docs/2.4.0/structured-streaming-kafka-integration.html

Writing to multiple Kafka partitions from Spark

2019-05-24 Thread Femi Anthony
I have Spark code that writes a batch to Kafka as specified here: https://spark.apache.org/docs/2.4.0/structured-streaming-kafka-integration.html The code looks like the following: df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \ .write \ .format("kafka") \

Re: [External Sender] How to use same SparkSession in another app?

2019-04-16 Thread Femi Anthony
Why not save the data frame to persistent storage s3/HDFS in the first application and read it back in the 2nd ? On Tue, Apr 16, 2019 at 8:58 PM Rishikesh Gawade wrote: > Hi. > I wish to use a SparkSession created by one app in another app so that i > can use the dataframes belonging to that

Spark Stateful Streaming - add counter column

2019-01-23 Thread Femi Anthony
I have a a Spark Streaming process that consumes records off a Kafka topic, processes them and sends them to a producer to publish on another topic. I would like to add a sequence number column that can be used to identify records that have the same key and be incremented for each duplicate

Re: [External Sender] Having access to spark results

2018-10-25 Thread Femi Anthony
What sort of environment are you running Spark on - in the cloud, on premise ? Is its a real-time or batch oriented application? Please provide more details. Femi On Thu, Oct 25, 2018 at 3:29 AM Affan Syed wrote: > Spark users, > We really would want to get an input here about how the results

Re: [External Sender] Writing dataframe to vertica

2018-10-16 Thread Femi Anthony
How are you trying to write to Vertica ? Can you provide some snippets of code ? Femi On Tue, Oct 16, 2018 at 7:24 PM Nikhil Goyal wrote: > Hi guys, > > I am trying to write dataframe to vertica using spark. It seems like spark > is creating a temp table under public schema. I don't have

Re: [External Sender] Pyspark Window orderBy

2018-10-16 Thread Femi Anthony
I think that’s how it should behave. Did you try it out and see ? On Tue, Oct 16, 2018 at 5:11 AM mhussain wrote: > Hi, > > I have a dataframe which looks like > > ++---+--++ > |group_id| id| text|type| > ++---+--++ > | 1| 1| one| a| > | 1| 1|

Re: [External Sender] How to debug Spark job

2018-09-07 Thread Femi Anthony
One way I would go about this would be to try running a new_df.show(numcols, truncate=False) on a few columns before you try writing to parquet to force computation of newdf and see whether the hanging is occurring at that point or during the write. You may also try doing a newdf.count() as well.

Re: [External Sender] Re: How to make pyspark use custom python?

2018-09-06 Thread Femi Anthony
Are you sure that pyarrow is deployed on your slave hosts ? If not, you will either have to get it installed or ship it along when you call spark-submit by zipping it up and specifying the zipfile to be shipped using the --py-files zipfile.zip option A quick check would be to ssh to a slave host,

Re: Sparklyr and idle executors

2018-03-16 Thread Femi Anthony
I assume you're setting these values in spark-defaults.conf. What happens if you specify them directly to spark-submit as in --conf spark.dynamicAllocation.enabled=true ? On Thu, Mar 15, 2018 at 1:47 PM, Florian Dewes wrote: > Hi all, > > I am currently trying to enable

Re: Insufficient memory for Java Runtime

2018-03-14 Thread Femi Anthony
Try specifying executor memory. On Tue, Mar 13, 2018 at 5:15 PM, Shiyuan wrote: > Hi Spark-Users, > I encountered the problem of "insufficient memory". The error is logged > in the file with a name " hs_err_pid86252.log"(attached in the end of this > email). > > I launched

Re: How to run spark shell using YARN

2018-03-14 Thread Femi Anthony
for > application_1521014458020_0003 (state: ACCEPTED) > > 18/03/14 09:30:15 INFO Client: Application report for > application_1521014458020_0003 (state: ACCEPTED) > > On Wed, Mar 14, 2018 at 2:03 AM, Femi Anthony <femib...@gmail.com> wrote: > >> Make sure you have e

Re: How to run spark shell using YARN

2018-03-14 Thread Femi Anthony
Make sure you have enough memory allocated for Spark workers, try specifying executor memory as follows: --executor-memory to spark-submit. On Wed, Mar 14, 2018 at 3:25 AM, kant kodali wrote: > I am using spark 2.3.0 and hadoop 2.7.3. > > Also I have done the following

Re: Spark Application stuck

2018-03-14 Thread Femi Anthony
Have you taken a look at the EMR UI ? What does your Spark setup look like ? I assume you're on EMR on AWS. The various UI urls and ports are listed here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/ emr-web-interfaces.html On Wed, Mar 14, 2018 at 4:23 AM, Mukund Big Data

Re: Job never finishing

2018-02-20 Thread Femi Anthony
You can use spark speculation as a way to get around the problem. Here is a useful link: http://asyncified.io/2016/08/13/leveraging-spark-speculation-to-identify-and-re-schedule-slow-running-tasks/ Sent from my iPhone > On Feb 20, 2018, at 5:52 PM, Nikhil Goyal wrote: >

Re: Multiple filters vs multiple conditions

2017-10-03 Thread Femi Anthony
I would assume that the optimizer would end up transforming both to the same expression. Femi Sent from my iPhone > On Oct 3, 2017, at 8:14 AM, Ahmed Mahmoud wrote: > > Hi All, > > Just a quick question from an optimisation point of view: > > Approach 1: > .filter (t->

Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-16 Thread Femi Anthony
How are you specifying it, as an option to spark-submit ? On Sat, Sep 16, 2017 at 12:26 PM, Akhil Das wrote: > spark.sql.shuffle.partitions is still used I believe. I can see it in the > code >

Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-10 Thread Femi Anthony
I know spark.write.csv works best with HDFS, but with the current setup I > have in my environment, I have to deal with spark write to node’s local file > system and not to HDFS. > > Regards, > Hemanth > > From: Femi Anthony <femib...@gmail.com> > Date:

Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-10 Thread Femi Anthony
is created on master node, but the problem of _temporary is > noticed only on worked nodes. > > I know spark.write.csv works best with HDFS, but with the current setup I > have in my environment, I have to deal with spark write to node’s local file > system and not to HDFS. >

Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-10 Thread Femi Anthony
Normally the* _temporary* directory gets deleted as part of the cleanup when the write is complete and a SUCCESS file is created. I suspect that the writes are not properly completed. How are you specifying the write ? Any error messages in the logs ? On Thu, Aug 10, 2017 at 3:17 AM, Hemanth

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Femi Anthony
work. Another approach maybe for Spark Streaming to write to Kafka, and then have another process read from Kafka and write to Greenplum. Kafka Connect may be useful in this case - https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/ Femi Anthony

Re: Run spark machine learning example on Yarn failed

2017-02-28 Thread Femi Anthony
Have you tried specifying an absolute instead of a relative path ? Femi > On Feb 27, 2017, at 8:18 PM, Yunjie Ji wrote: > > After start the dfs, yarn and spark, I run these code under the root > directory of spark on my master host: > `MASTER=yarn ./bin/run-example

Re: Get S3 Parquet File

2017-02-27 Thread Femi Anthony
Ok, thanks a lot for the heads up. Sent from my iPhone > On Feb 25, 2017, at 10:58 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > >> On 24 Feb 2017, at 07:47, Femi Anthony <femib...@gmail.com> wrote: >> >> Have you tried reading using s3n w

Re: Get S3 Parquet File

2017-02-23 Thread Femi Anthony
Have you tried reading using s3n which is a slightly older protocol ? I'm not sure how compatible s3a is with older versions of Spark. Femi On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim wrote: > Hi Gourav, > > My answers are below. > > Cheers, > Ben > > > On Feb 23, 2017,

Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Femi Anthony
wrote: > Hi Femi, > > Have you maybe tried the quote related options specified in the > documentation? > > http://spark.apache.org/docs/latest/api/python/pyspark.sql. > html#pyspark.sql.DataFrameReader.csv > > Thanks. > > 2016-11-06 6:58 GMT+09:00 Femi Anthony <

Reading csv files with quoted fields containing embedded commas

2016-11-05 Thread Femi Anthony
Hi, I am trying to process a very large comma delimited csv file and I am running into problems. The main problem is that some fields contain quoted strings with embedded commas. It seems as if PySpark is unable to properly parse lines containing such fields like say Pandas does. Here is the code

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Femi Anthony
Please send it to me as well. Thanks Sent from my iPhone > On May 17, 2016, at 12:09 PM, Raghavendra Pandey > wrote: > > Can you please send me as well. > > Thanks > Raghav > >> On 12 May 2016 20:02, "Tom Ellis" wrote: >> I would like to

Re: Timeout when submitting an application to a remote Spark Standalone master

2016-04-29 Thread Femi Anthony
Have you tried connecting to the port 7077 on the cluster from your local machine to see if it works ok ? Sent from my iPhone > On Apr 29, 2016, at 5:58 PM, Richard Han wrote: > > I have an EC2 installation of Spark Standalone Master/Worker set up. The two > can talk to

Re: transformation - spark vs cassandra

2016-03-31 Thread Femi Anthony
Try it out on a smaller subset of data and see which gives the better performance. On Thu, Mar 31, 2016 at 12:11 PM, Arun Sethia wrote: > Thanks Imre. > > But I thought spark-cassandra driver is going to do same internally. > > On Thu, Mar 31, 2016 at 10:32 AM, Imre Nagi

Re: confusing about Spark SQL json format

2016-03-31 Thread Femi Anthony
I encountered a similar problem reading multi-line JSON files into Spark a while back, and here's an article I wrote about how to solve it: http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ You may find it useful. Femi On Thu, Mar 31, 2016 at 12:32 PM,

Re: How to design the input source of spark stream

2016-03-31 Thread Femi Anthony
Also, ssc.textFileStream(dataDir) will read all the files from a directory so as far as I can see there's no need to merge the files. Just write them to the same HDFS directory. On Thu, Mar 31, 2016 at 8:04 AM, Femi Anthony <femib...@gmail.com> wrote: > I don't think you need to do it

Re: How to design the input source of spark stream

2016-03-31 Thread Femi Anthony
I don't think you need to do it this way. Take a look here : http://spark.apache.org/docs/latest/streaming-programming-guide.html in this section: Level of Parallelism in Data Receiving Receiving multiple data streams can therefore be achieved by creating multiple input DStreams and configuring

Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Femi Anthony
I am working on Spark Streaming API and I wish to stream a set of pre-downloaded web log files continuously to simulate a real-time stream. I wrote a script that gunzips the compressed logs and pipes the output to nc on port . The script looks like this:

Re: Appending filename information to RDD initialized by sc.textFile

2016-01-20 Thread Femi Anthony
e/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala> > - WholeTextFileRecordReader > <https://github.com/apache/spark/blob/7a375bb87a8df56d9dde0c484e725e5c497a9876/core/src/main/scala/org/apache/spark/input/WholeTextFileRecordReader.scala> > > > > > > Thanks

Appending filename information to RDD initialized by sc.textFile

2016-01-19 Thread Femi Anthony
I have a set of log files I would like to read into an RDD. These files are all compressed .gz and are the filenames are date stamped. The source of these files is the page view statistics data for wikipedia http://dumps.wikimedia.org/other/pagecounts-raw/ The file names look like this:

Re: Spark Cassandra Java Connector: records missing despite consistency=ALL

2016-01-19 Thread Femi Anthony
So is the logging to Cassandra being done via Spark ? On Wed, Jan 13, 2016 at 7:17 AM, Dennis Birkholz wrote: > Hi together, > > we Cassandra to log event data and process it every 15 minutes with Spark. > We are using the Cassandra Java Connector for Spark. > > Randomly

Re: pyspark: calculating row deltas

2016-01-10 Thread Femi Anthony
Can you clarify what you mean with an actual example ? For example, if your data frame looks like this: ID Year Value 12012 100 22013 101 32014 102 What's your desired output ? Femi On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter wrote: > > Hi, > >