Re: Maximum executors in EC2 Machine

2023-10-24 Thread Riccardo Ferrari
Hi, I would refer to their documentation to better understand the concepts behind cluster overview and submitting applications: - https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types - https://spark.apache.org/docs/latest/submitting-applications.html When

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Riccardo Ferrari
d the subset of data > that it's meant to model separately. > > A pandas UDF is a fine solution here, because I assume that implies your > groups aren't that big, so, maybe no need for a Spark pipeline. > > > On Thu, Jan 21, 2021 at 9:20 AM Riccardo Ferrari > wro

Pyspark How to groupBy -> fit

2021-01-21 Thread Riccardo Ferrari
Hi list, I am looking for an efficient solution to apply a training pipeline to each group of a DataFrame.groupBy. This is very easy if you're using a pandas udf (i.e. groupBy().apply()), I am not able to find the equivalent for a spark pipeline. The ultimate goal is to fit multiple models, one

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Riccardo Ferrari
I think spark checks the python path env variable. Need to provide that. Of course that works in local mode only On Fri, Jan 8, 2021, 5:28 PM Sean Owen wrote: > I don't see anywhere that you provide 'sparkstuff'? how would the Spark > app have this code otherwise? > > On Fri, Jan 8, 2021 at

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Riccardo Ferrari
You need to provide your python dependencies as well. See http://spark.apache.org/docs/latest/submitting-applications.html, look for --py-files HTH On Fri, Jan 8, 2021 at 3:13 PM Mich Talebzadeh wrote: > Hi, > > I have a module in Pycharm which reads data stored in a Bigquery table and > does

Re: Compute the Hash of each row in new column

2020-02-28 Thread Riccardo Ferrari
Hi Chetan, Would the sql function `hash` do the trick for your use-case ? Best, On Fri, Feb 28, 2020 at 1:56 PM Chetan Khatri wrote: > Hi Spark Users, > How can I compute Hash of each row and store in new column at Dataframe, > could someone help me. > > Thanks >

Cluster sizing

2019-09-13 Thread Riccardo Ferrari
Hi list, Is there any documentation about how to approach cluster sizing. How do you approach a new deployment? Thanks,

Re: spark standalone mode problem about executor add and removed again and again!

2019-07-18 Thread Riccardo Ferrari
I would also check firewall rules. Is communication allowed on all the required port ranges and hosts ? On Thu, Jul 18, 2019 at 3:56 AM Amit Sharma wrote: > Do you have dynamic resource allocation enabled? > > > On Wednesday, July 17, 2019, zenglong chen > wrote: > >> Hi,all, >> My

Re: best docker image to use

2019-06-11 Thread Riccardo Ferrari
Hi Marcelo, I'm used to work with https://github.com/jupyter/docker-stacks. There's the Scala+jupyter option too. Though there might be better option with Zeppelin too. Hth On Tue, 11 Jun 2019, 11:52 Marcelo Valle, wrote: > Hi, > > I would like to run spark shell + scala on a docker

Re: Spark standalone and Pandas UDF from custom archive

2019-05-25 Thread Riccardo Ferrari
our job modules and associated library code into a > single zip file (jobs.py). You can pass this zip file to spark submit via > pyfiles and use the generic launcher as the driver: > > Spark-submit —py-files jobs.py driver.py job-module > > > > > > On 25 May 2019, at 10:

Re: Spark standalone and Pandas UDF from custom archive

2019-05-25 Thread Riccardo Ferrari
Riccardo Ferrari wrote: > Hi list, > > I am having an issue ditributing a pandas_udf to my workers. > I'm using Spark 2.4.1 in standalone mode. > *Submit*: > >- via SparkLauncher as separate process. I do add the py-files with >the self-executable zip (with .py ext

Spark standalone and Pandas UDF from custom archive

2019-05-24 Thread Riccardo Ferrari
Hi list, I am having an issue ditributing a pandas_udf to my workers. I'm using Spark 2.4.1 in standalone mode. *Submit*: - via SparkLauncher as separate process. I do add the py-files with the self-executable zip (with .py extension) before launching the application. - The whole

Re: Deep Learning with Spark, what is your experience?

2019-05-05 Thread Riccardo Ferrari
t of the pipeline (afaik) so it is >>>> limited to data ingestion and transforms (ETL). It therefore is optional >>>> and other ETL options might be better for you. >>>> >>>> Most of the technologies @Gourav mentions have their own scaling based >>

Re: Deep Learning with Spark, what is your experience?

2019-05-04 Thread Riccardo Ferrari
Sengupta > > Date: May 4, 2019 at 10:24:29 AM > To: Riccardo Ferrari > Cc: User > Subject: Re: Deep Learning with Spark, what is your experience? > > Try using MxNet and Horovod directly as well (I think that MXNet is worth > a try as well): > 1. > https://medi

Deep Learning with Spark, what is your experience?

2019-05-04 Thread Riccardo Ferrari
Hi list, I am trying to undestand if ti make sense to leverage on Spark as enabling platform for Deep Learning. My open question to you are: - Do you use Apache Spark in you DL pipelines? - How do you use Spark for DL? Is it just a stand-alone stage in the workflow (ie data preparation

PySpark OOM when running PCA

2019-02-07 Thread Riccardo Ferrari
Hi list, I am having troubles running a PCA with pyspark. I am trying to reduce a matrix size since my features after OHE gets 40k wide. Spark 2.2.0 Stand-alone (Oracle JVM) pyspark 2.2.0 from a docker (OpenJDK) I'm starting the spark session from the notebook however I make sure to: -

Re: I have trained a ML model, now what?

2019-01-23 Thread Riccardo Ferrari
like Uber trying to make Open Source better, thanks! On Tue, Jan 22, 2019 at 5:24 PM Felix Cheung wrote: > About deployment/serving > > SPIP > https://issues.apache.org/jira/browse/SPARK-26247 > > > -- > *From:* Riccardo Ferrari > *Sent:*

Re: How to query on Cassandra and load results in Spark dataframe

2019-01-23 Thread Riccardo Ferrari
Hi Soheil, You should able to apply some filter transformation. Spark is lazy evaluated and the actual loading from Cassandra happens only when an action triggers it. Find more here: https://spark.apache.org/docs/2.3.2/rdd-programming-guide.html#rdd-operations The Spark Cassandra supports

I have trained a ML model, now what?

2019-01-22 Thread Riccardo Ferrari
Hi list! I am writing here to here about your experience on putting Spark ML models into production at scale. I know it is a very broad topic with many different faces depending on the use-case, requirements, user base and whatever is involved in the task. Still I'd like to open a thread about

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Riccardo Ferrari
Hi Aakash, Can you share how are you adding those jars? Are you using the package method ? I assume you're running in a cluster, and those dependencies might have not properly distributed. How are you submitting your app? What kind of resource manager are you using standalone, yarn, ... Best,

Re: Pyspark Partitioning

2018-09-30 Thread Riccardo Ferrari
Hi Dimitris, I believe the methods partitionBy and mapPartitions are specific to RDDs while you're talking about DataFrames

Re: jar file problem

2017-10-19 Thread Riccardo Ferrari
This is a good place to start from: https://spark.apache.org/docs/latest/submitting-applications.html Best, On Thu, Oct 19, 2017 at 5:24 PM, Uğur Sopaoğlu wrote: > Hello, > > I have a very easy problem. How I run a spark job, I must copy jar file to > all worker nodes. Is

Re: [Timer-0:WARN] Logging$class: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-09-18 Thread Riccardo Ferrari
Hi Jean, What does the master UI say? http://10.0.100.81:8080 Do you have enough resources availalbe, or is there any running context that is depleting all your resources ? Are your workers registered and alive ? How much memory each? How many cores each ? Best On Mon, Sep 18, 2017 at 11:24

Re: How to convert Row to JSON in Java?

2017-09-11 Thread Riccardo Ferrari
ow.schema().fieldNames(); > Seq fieldNamesSeq = > JavaConverters.asScalaIteratorConverter(Arrays.asList(fieldNames).iterator()).asScala().toSeq(); > String json = row.getValuesMap(fieldNamesSeq).toString(); > > > On Mon, Sep 11, 2017 at 12:39 AM, Riccardo Ferrari <ferra...@gmail.com> > wrote: &g

Re: How to convert Row to JSON in Java?

2017-09-11 Thread Riccardo Ferrari
* > >>> df1.take(10) > ['{"id": 1}', '{"id": 2}', '{"id": 3}', '{"id": 4}', '{"id": 5}'] > > > > > On Mon, Sep 11, 2017 at 4:22 AM, Riccardo Ferrari <ferra...@gmail.com> > wrote: > >> Hi Kant, >> >

Re: How to convert Row to JSON in Java?

2017-09-10 Thread Riccardo Ferrari
Hi Kant, You can check the getValuesMap . I found this post useful, it is in Scala but should be a good starting point. An alternative

Re: Problem with CSV line break data in PySpark 2.1.0

2017-09-03 Thread Riccardo Ferrari
Hi Aakash, What I see in the picture seems correct. Spark (pyspark) is reading your F2 cell as a multi-line text. Where are the nulls you're referring to? You might find the pyspark.sql.functions.regexp_replace

Re: PySpark, Structured Streaming and Kafka

2017-08-23 Thread Riccardo Ferrari
Hi Brian, Very nice work you have done! WRT you issue: Can you clarify how are you adding the kafka dependency when using Jupyter? The ClassNotFoundException really tells you about the missing dependency. A bit different is the IllegalArgumentException error, that is simply because you are not

Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Riccardo Ferrari
Depends on your Spark version, have you considered the Dataset api? You can do something like: val df1 = rdd1.toDF("userid") val listRDD = sc.parallelize(listForRule77) val listDF = listRDD.toDF("data") df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false)

Re: Spark Streaming job statistics

2017-08-08 Thread Riccardo Ferrari
Hi, Have you tried to check the "Streaming" tab menu? Best, On Tue, Aug 8, 2017 at 4:15 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am running spark streaming job which receives data from azure iot hub. I > am not sure if the connection was successful and receving

PySpark Streaming keeps dying

2017-08-05 Thread Riccardo Ferrari
Hi list, I have Sark 2.2.0 in standalone mode and python 3.6. It is a very small testing cluster with two nodes. I am running (trying) a streaming job that simple read from kafka, apply an ML model and store it back into kafka. The job is run with following parameters: "--conf spark.cores.max=2

Re: PySpark Streaming S3 checkpointing

2017-08-03 Thread Riccardo Ferrari
context is ignored (or at least this happens in python). What do you (or any in the list) think? Thanks, On Wed, Aug 2, 2017 at 6:04 PM, Steve Loughran <ste...@hortonworks.com> wrote: > > On 2 Aug 2017, at 10:34, Riccardo Ferrari <ferra...@gmail.com> wrote: > > Hi list! &

PySpark Streaming S3 checkpointing

2017-08-02 Thread Riccardo Ferrari
Hi list! I am working on a pyspark streaming job (ver 2.2.0) and I need to enable checkpointing. At high level my python script goes like this: class StreamingJob(): def __init__(..): ... sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key',)

Re: SPARK Issue in Standalone cluster

2017-07-31 Thread Riccardo Ferrari
Hi Gourav, The issue here is the location where you're trying to write/read from : /Users/gouravsengupta/Development/spark/sparkdata/test1/p... When dealing with clusters all the paths and resources should be available to all executors (and driver), and that is reason why you generally use HDFS,

Re: Logging in RDD mapToPair of Java Spark application

2017-07-29 Thread Riccardo Ferrari
Hi John, The reason you don't see the second sysout line is because is executed on a different JVM (ie. Driver vs Executor). the second sysout line should be available through the executor logs. Check the Executors tab. There are alternative approaches to manage log centralization however it

Re: Flatten JSON to multiple columns in Spark

2017-07-18 Thread Riccardo Ferrari
What's against: df.rdd.map(...) or dataset.foreach() https://spark.apache.org/docs/2.0.1/api/scala/index.html#org.apache.spark.sql.Dataset@foreach(f:T= >Unit):Unit Best, On Tue, Jul 18, 2017 at 6:46 PM, lucas.g...@gmail.com wrote: > I've been wondering about this for

Re: Spark UI crashes on Large Workloads

2017-07-18 Thread Riccardo Ferrari
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > > > > I noticed this issue talks about something similar and I guess is related: > https://issues.apache.org/jira/browse/SPARK-18838. > > On Tue, Jul 18, 2017 at 2:49 AM, Riccardo Ferrari <ferra...@gmai

Re: Spark UI crashes on Large Workloads

2017-07-18 Thread Riccardo Ferrari
Hi, can you share more details. do you have any exceptions from the driver? or executors? best, On Jul 18, 2017 02:49, "saatvikshah1994" wrote: > Hi, > > I have a pyspark App which when provided a huge amount of data as input > throws the error explained here

Re: Testing another Dataset after ML training

2017-07-12 Thread Riccardo Ferrari
or Hadron Physics > Experimental Hadron Structure (IKP-1) > www.fz-juelich.de/ikp > > On Jul 12, 2017, at 09:56, Riccardo Ferrari <ferra...@gmail.com> wrote: > > Hi Michael, > > I don't see any attachment, not sure you can attach files though > > On Tue, Jul 11, 2017 at 10

Re: Testing another Dataset after ML training

2017-07-12 Thread Riccardo Ferrari
- > Michael C. Kunkel, USMC, PhD > Forschungszentrum Jülich > Nuclear Physics Institute and Juelich Center for Hadron Physics > Experimental Hadron Structure (IKP-1)www.fz-juelich.de/ikp > > On 11/07/2017 17:21, Riccardo Ferrari wrote: > > Mh, to me feels

Re: Testing another Dataset after ML training

2017-07-11 Thread Riccardo Ferrari
Mh, to me feels like there some data mismatch. Are you sure you're using the expected Vector (ml vs mllib). I am not sure you attached the whole Exception but you might find some more useful details there. Best, On Tue, Jul 11, 2017 at 3:07 PM, mckunkel wrote: > Im not

Re: Testing another Dataset after ML training

2017-07-11 Thread Riccardo Ferrari
Hi, Are you sure you're feeding the correct data format? I found this conversation that might be useful: http://apache-spark-user-list.1001560.n3.nabble.com/Description-of-data-file-sample-libsvm-data-txt-td25832.html Best, On Tue, Jul 11, 2017 at 1:42 PM, mckunkel

PySpark saving custom pipelines

2017-07-09 Thread Riccardo Ferrari
Hi list, I have developed some custom Transformer/Estimators in python around some libraries (scipy), now I would love to persist them for reuse in a streaming app. I am currently aware of this: https://issues.apache.org/jira/browse/SPARK-17025 However I would like to hear from experienced

Re: Problem in avg function Spark 1.6.3 using spark-shell

2017-06-25 Thread Riccardo Ferrari
Hi, Looks like you performed an aggregation on the ImageWidth column already. The error itself is quite self-explanatory: Cannot resolve column name "ImageWidth" among (MainDomainCode, *avg(length(ImageWidth))*) The column available in that DF are MainDomainCode and avg(length(ImageWidth)) so

Re: Cassandra querying time stamps

2017-06-20 Thread Riccardo Ferrari
Hi, Personally I would inspect how dates are managed. How does your spark code looks like? What does the explain say. Does TimeStamp gets parsed the same way? Best, On Tue, Jun 20, 2017 at 12:52 PM, sujeet jog wrote: > Hello, > > I have a table as below > > CREATE TABLE

Re: Create dataset from dataframe with missing columns

2017-06-15 Thread Riccardo Ferrari
Hi Jason, Is there a reason why you are not adding the desired column before mapping it to a Dataset[CC]? You could just do something like: df = df.withColumn('f2', ) then do the: df.as(CC) Of course your default value can be null: lit(None).cast(to-some-type) best, On Thu, Jun 15, 2017 at

Re: UDF percentile_approx

2017-06-14 Thread Riccardo Ferrari
Hi Andres, I can't find the refrence, last time I searched for that I found that 'percentile_approx' is only available via hive context. You should register a temp table and use it from there. Best, On Tue, Jun 13, 2017 at 8:52 PM, Andrés Ivaldi wrote: > Hello, I`m trying

Re: Read Data From NFS

2017-06-13 Thread Riccardo Ferrari
Hi Ayan, You might be interested in the official Spark docs: https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism and its spark.default.parallelism setting Best, On Mon, Jun 12, 2017 at 6:18 AM, ayan guha wrote: > I understand how it works with hdfs. My