Profiling options for PandasUDF (2.4.7 on yarn)

2021-05-28 Thread Patrick McCarthy
lumns of (count, row_id, column_id). It works at small scale but gets unstable as I scale up. Is there a way to profile this function in a spark session or am I limited to profiling on pandas data frames without spark? -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engine

Re: Issue while installing dependencies Python Spark

2020-12-18 Thread Patrick McCarthy
ocumentation linked above for further help. > > Original error was: No module named 'numpy.core._multiarray_umath' > > > > Kind Regards, > Sachit Murarka > > > On Thu, Dec 17, 2020 at 9:24 PM Patrick McCarthy > wrote: > >> I'm not very fami

Re: Getting error message

2020-12-17 Thread Patrick McCarthy
running code in a local machine that is single node machine. > > Getting into logs, it looked like the host is killed. This is happening > very frequently an I am unable to find the reason of this. > > Could low memory be the reason? > > On Fri, 18 Dec 2020, 00:11 Patrick

Re: Getting error message

2020-12-17 Thread Patrick McCarthy
the > system, program starts running fine. > This error goes away on > > On Thu, 17 Dec 2020, 23:50 Patrick McCarthy, > wrote: > >> my-domain.com/192.168.166.8:63534 probably isn't a valid address on your >> network, is it? >> >> On Thu, Dec 17, 2020

Re: Issue while installing dependencies Python Spark

2020-12-17 Thread Patrick McCarthy
ath/to/venv/bin/python3 --conf > spark.pyspark.driver.python=/path/to/venv/bin/python3 > > This did not help too.. > > Kind Regards, > Sachit Murarka > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-30 Thread Patrick McCarthy
gt; ) > > Are there other Spark patterns that I should attempt in order to achieve > my end goal of a vector of attributes for every entity? > > Thanks, Daniel > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Hive using Spark engine vs native spark with hive integration.

2020-10-07 Thread Patrick McCarthy
rol some of the > performance features, for example things like caching/evicting etc. > > > > > > Any advice on this is much appreciated. > > > > > > Thanks, > > -Manu > > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: regexp_extract regex for extracting the columns from string

2020-08-10 Thread Patrick McCarthy
f function. > apart from udf,is there any way to achieved it. > > > Thanks > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > ----- > To unsubscribe e-mail: user-unsubscr...@spark

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
al client. > > ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the > python udfs I tried had very bad performance, but I will give it a try in > this case. It can't be worse. > Thanks again! > > Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Patrick McCarthy
; > Best regards >> > Mukhtaj >> > >> > >> > >> > >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Building Spark 3.0.0 for Hive 1.2

2020-07-10 Thread Patrick McCarthy
on/pyspark/sql/session.py", line 191, in getOrCreate session._jsparkSession.sessionState().conf().setConfString(key, value) File "/home/pmccarthy/custom-spark-3/python/lib/py4j-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/home/pmccarthy/custom-spark-3/p

Re: Reading TB of JSON file

2020-06-18 Thread Patrick McCarthy
afford having 50 GB on driver memory. In general, what > is the best practice to read large JSON file like 50 GB? > > Thanks > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Add python library

2020-06-08 Thread Patrick McCarthy
has binary code? Below is am >>> example: >>> >>> def do_something(p): >>> ... >>> >>> rdd = sc.parallelize([ >>> {"x": 1, "y": 2}, >>> {"x": 2, "y": 3}, >>>

Re: Save Spark dataframe as dynamic partitioned table in Hive

2020-04-16 Thread Patrick McCarthy
adcastStaging PARTITION (broadcastId = > broadcastValue, brand = dummy) > > -^^^ > SELECT > ocis_party_id AS partyId > , target_mobile_no AS phoneNumber > FROM tmp > > I

Design pattern to invert a large map

2020-03-31 Thread Patrick McCarthy
I'm not a software engineer by training and I hope that there's an existing best practice for the problem I'm trying to solve. I'm using Spark 2.4.5, Hadoop 2.7, Hive 1.2. I have a large table (terabytes) from an external source (which is beyond my control) where the data is stored in a key-value

Re: Using Percentile in Spark SQL

2019-11-11 Thread Patrick McCarthy
the >> employee or agent responsible for delivering it to the intended recipient, >> you are hereby notified that any use, dissemination, distribution or >> copying of this communication and/or its content is strictly prohibited. If >> you are not the intended recipient, please

Best practices for data like file storage

2019-11-01 Thread Patrick McCarthy
Hi List, I'm looking for resources to learn about how to store data on disk for later access. For a while my team has been using Spark on top of our existing hdfs/Hive cluster without much agency as far as what format is used to store the data. I'd like to learn more about how to re-stage my data

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Patrick McCarthy
9 at 11:44 AM Abhinesh Hada wrote: > Hi, > > I am trying to take union of 2 dataframes and then drop duplicate based on > the value of a specific column. But, I want to make sure that while > dropping duplicates, the rows from first data frame are kept. > > Example: > df1

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Patrick McCarthy
rieb Dhrubajyoti Hati < >>>> dhruba.w...@gmail.com>: >>>> >>>>> No, i checked for that, hence written "brand new" jupyter notebook. >>>>> Also the time taken by both are 30 mins and ~3hrs as i am reading a 500 >>>>>

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Patrick McCarthy
from which both are executed are also the same and from same user. > > What i found is the the quantile values for median for one ran with > jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not > able to figure out why this is happening. > > Any one faced

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Patrick McCarthy
*select >>>> count(*) from ExtTable* via the Hive CLI, it successfully gives me the >>>> expected count of records in the table. >>>> However, when i fire the same query via sparkSQL, i get count = 0. >>>> >>>> I think the sparkSQL isn't able to descend into the subdirectories for >>>> getting the data while hive is able to do so. >>>> Are there any configurations needed to be set on the spark side so that >>>> this works as it does via hive cli? >>>> I am using Spark on YARN. >>>> >>>> Thanks, >>>> Rishikesh >>>> >>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external >>>> table, orc, sparksql, yarn >>>> >>> -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Spark Image resizing

2019-07-31 Thread Patrick McCarthy
= spark.read.format("image").load(imageDir) >> >> Can you please help me with this? >> >> Nick >> > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Patrick McCarthy
, so precomputing our coefficients is not an option. This > needs to be done on request. > > > > I've been looking at many compute solutions, but I'd consider Spark first > due to the widespread use and community. I currently have my data loaded > into Apache Hbase for a different scenario (random access of rows/columns). > I’ve naively tired loading a dataframe from the CSV using a Spark instance > hosted on AWS EMR, but getting the results for even a single correlation > takes over 20 seconds. > > > > Thank you! > > > > > > --gautham > > > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: spark python script importError problem

2019-07-16 Thread Patrick McCarthy
s(obj) > ImportError: No module named feature.user.user_feature > > The script also run well in "sbin/start-master.sh sbin/start-slave.sh",but > it has the same importError problem in "sbin/start-master.sh > sbin/start-slaves.sh".The conf/slaves contents is 

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
d getting back >> an array. >> >> >> Regards, >> Gourav Sengupta >> >> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy >> wrote: >> >>> Human time is considerably more expensive than computer time, so in that >>> regard, yes :

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
our code and > prove it? > > On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy > wrote: > >> I disagree that it's hype. Perhaps not 1:1 with pure scala >> performance-wise, but for python-based data scientists or others with a lot >> of python expertise it allows on

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-05 Thread Patrick McCarthy
tried to package it according to this instructions, it >> got distributed on the cluster however the same spark program that takes 5 >> mins without pandas UDF has started to take 25mins... >> >> Have you experienced anything like this? Also is Pyarrow 0.12 supported >> w

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-04-30 Thread Patrick McCarthy
but this directory doesn't >> include all the packages to form a proper parcel for distribution. >> >> Any help is much appreciated! >> >> -- >> Regards, >> >> Rishi Shah >> > > > -- > Regards, > > Rishi Shah > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Spark ML with null labels

2019-01-10 Thread Patrick McCarthy
sn't always need to do 1:1 mapping. > > On Thu, Jan 10, 2019, 7:53 AM Patrick McCarthy > >> I'm trying to implement an algorithm on the MNIST digits that runs like >> so: >> >> >>- for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 la

Spark ML with null labels

2019-01-10 Thread Patrick McCarthy
I'm trying to implement an algorithm on the MNIST digits that runs like so: - for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 label to the digits and build a LogisticRegression Classifier -- 45 in total - Fit every classifier on the test set separately - Aggregate the res

Re: Need help with SparkSQL Query

2018-12-17 Thread Patrick McCarthy
Untested, but something like the below should work: from pyspark.sql import functions as F from pyspark.sql import window as W (record .withColumn('ts_rank', F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id")) .filter(F.col('ts_rank')==1) .drop('ts_rank') ) On Mon, Dec 17, 2018

Re: Questions on Python support with Spark

2018-11-12 Thread Patrick McCarthy
I've never tried to run a stand-alone cluster alongside hadoop, but why not run Spark as a yarn application? That way it can absolutely (in fact preferably) use the distributed file system. On Fri, Nov 9, 2018 at 5:04 PM, Arijit Tarafdar wrote: > Hello All, > > > > We have a requirement to run P

Re: Python Dependencies Issue on EMR

2018-09-14 Thread Patrick McCarthy
You didn't say how you're zipping the dependencies, but I'm guessing you either include .egg files or zipped up a virtualenv. In either case, the extra C stuff that scipy and pandas rely upon doesn't get included. An approach like this solved the last problem I had that seemed like this - https://

Re: How to make pyspark use custom python?

2018-09-06 Thread Patrick McCarthy
It looks like for whatever reason your cluster isn't using the python you distributed, or said distribution doesn't contain what you think. I've used the following with success to deploy a conda environment to my cluster at runtime: https://henning.kropponline.de/2016/09/24/running-pyspark-with-co

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If this is actually happening, it's just wasteful overhead. The ambition is to say "divide the data into partitions, but make sure you don't move it in doing so". On Tue, Aug 28, 2018 at 2:06 PM

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
/in/sonalgoyal> > > > > On Tue, Aug 28, 2018 at 8:29 PM, Patrick McCarthy < > pmccar...@dstillery.com.invalid> wrote: > >> Mostly I'm guessing that it adds efficiency to a job where partitioning >> is required but shuffling is not. >> >> For exa

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
> by hostname prior to that or not. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent f

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
> by hostname prior to that or not. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent f

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
> by hostname prior to that or not. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent f

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
> by hostname prior to that or not. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent f

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
that it's totally balanced, then I'd hope that I could save a lot of overhead with foo = df.withColumn('randkey',F.floor(1000*F.rand())).repartition(5000, 'randkey','host').apply(udf) On Tue, Aug 28, 2018 at 10:28 AM, Patrick McCarthy wrote: > Mostl

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
hostname prior to that or not. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent f

Pitfalls of partitioning by host?

2018-08-27 Thread Patrick McCarthy
When debugging some behavior on my YARN cluster I wrote the following PySpark UDF to figure out what host was operating on what row of data: @F.udf(T.StringType()) def add_hostname(x): import socket return str(socket.gethostname()) It occurred to me that I could use this to enforce node

Re: How to merge multiple rows

2018-08-22 Thread Patrick McCarthy
You didn't specify which API, but in pyspark you could do import pyspark.sql.functions as F df.groupBy('ID').agg(F.sort_array(F.collect_set('DETAILS')).alias('DETAILS')).show() +---++ | ID| DETAILS| +---++ | 1|[A1, A2, A3]| | 3|[B2]| | 2|[B1]| +---+

Re: Arrow type issue with Pandas UDF

2018-07-24 Thread Patrick McCarthy
sampling a fraction of the data? If the problem persists could > you make a jira? At the very least a better exception would be nice. > > Bryan > > On Thu, Jul 19, 2018, 7:07 AM Patrick McCarthy > > wrote: > >> PySpark 2.3.1 on YARN, Python 3.6, PyArrow 0.8. &

Arrow type issue with Pandas UDF

2018-07-19 Thread Patrick McCarthy
PySpark 2.3.1 on YARN, Python 3.6, PyArrow 0.8. I'm trying to run a pandas UDF, but I seem to get nonsensical exceptions in the last stage of the job regardless of my output type. The problem I'm trying to solve: I have a column of scalar values, and each value on the same row has a sorted vecto

Re: DataTypes of an ArrayType

2018-07-11 Thread Patrick McCarthy
Arrays need to be a single type, I think you're looking for a Struct column. See: https://medium.com/@mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803 On Wed, Jul 11, 2018 at 6:37 AM, dimitris plakas wrote: > Hello everyone, > > I am new to Pyspark and i would like to ask if t

Re: Building SparkML vectors from long data

2018-07-03 Thread Patrick McCarthy
t;))) > .groupBy("ID") > .agg(collect_set(col("val")).name("collected_val")) > .withColumn("collected_val", > toVector(col("collected_val")).as[Row](Encoders.javaSerialization(classOf[Row]))) > > > at least works. The indices sti

Building SparkML vectors from long data

2018-06-12 Thread Patrick McCarthy
I work with a lot of data in a long format, cases in which an ID column is repeated, followed by a variable and a value column like so: +---+-+---+ |ID | var | value | +---+-+---+ | A | v1 | 1.0 | | A | v2 | 2.0 | | B | v1 | 1.5 | | B | v3 | -1.0 | +---+-+---+ I

Poor performance reading Hive table made of sequence files

2018-05-01 Thread Patrick McCarthy
I recently ran a query with the following form: select a.*, b.* from some_small_table a inner join ( select things from someother table lateral view explode(s) ss as sss where a_key is in (x,y,z) ) b on a.key = b.key where someothercriterion On hive, this query took about five minutes. In S

Reservoir sampling in parallel

2018-02-23 Thread Patrick McCarthy
I have a large dataset composed of scores for several thousand segments, and the timestamps at which time those scores occurred. I'd like to apply some techniques like reservoir sampling[1], where for every segment I process records in order of their timestamps, generate a sample, and then at inter

Re: Type Casting Error in Spark Data Frame

2018-01-29 Thread Patrick McCarthy
You can't select from an array like that, try instead using 'lateral view explode' in the query for that element, or before the sql stage (py)spark.sql.functions.explode. On Mon, Jan 29, 2018 at 4:26 PM, Arnav kumar wrote: > Hello Experts, > > I would need your advice in resolving the below issu

Re: Spark vs Snowflake

2018-01-22 Thread Patrick McCarthy
Last I heard of them a year or two ago, they basically repackage AWS services behind their own API/service layer for convenience. There's probably a value-add if you're not familiar with optimizing AWS, but if you already have that expertise I don't expect they would add much extra performance if a

Re: StreamingLogisticRegressionWithSGD : Multiclass Classification : Options

2018-01-19 Thread Patrick McCarthy
sing OneVsRest with StreamingLogisticRegressionWithSGD > ? > > Regards > Sundeep > > On Thu, Jan 18, 2018 at 8:18 PM, Patrick McCarthy > wrote: > >> As a hack, you could perform a number of 1 vs. all classifiers and then >> post-hoc select among the highest prediction probability to assign

Re: StreamingLogisticRegressionWithSGD : Multiclass Classification : Options

2018-01-18 Thread Patrick McCarthy
As a hack, you could perform a number of 1 vs. all classifiers and then post-hoc select among the highest prediction probability to assign class. On Thu, Jan 18, 2018 at 12:17 AM, Sundeep Kumar Mehta wrote: > Hi, > > I was looking for Logistic Regression with Multi Class classifier on > Streamin

Re: PySpark - Expand rows into dataframes via function

2017-10-03 Thread Patrick McCarthy
object) a > > If you want dynamic ip ranges; you need to dynamically construct structs > based on the range values. Hope this helps. > > > Thanks > > Sathish > > On Mon, Oct 2, 2017 at 9:01 AM Patrick McCarthy > wrote: > >> Hello, >> >> I'm try

PySpark - Expand rows into dataframes via function

2017-10-02 Thread Patrick McCarthy
Hello, I'm trying to map ARIN registry files into more explicit IP ranges. They provide a number of IPs in the range (here it's 8192) and a starting IP, and I'm trying to map it into all the included /24 subnets. For example, Input: array(['arin', 'US', 'ipv4', '23.239.160.0', 8192, 20131104.0,

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Patrick McCarthy
You might benefit from watching this JIRA issue - https://issues.apache.org/jira/browse/SPARK-19071 On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem wrote: > Is there a way to parallelize multiple ML algorithms in Spark. My use case > is something like this: > > A) Run multiple machine learning alg

Re: Memory problems with simple ETL in Pyspark

2017-04-16 Thread Patrick McCarthy
gt; > On Sun, 16 Apr 2017 at 11:06 am, ayan guha wrote: > >> It does not look like scala vs python thing. How big is your audience >> data store? Can it be broadcasted? >> >> What is the memory footprint you are seeing? At what point yarn is >> killing? Depenedi

Memory problems with simple ETL in Pyspark

2017-04-14 Thread Patrick McCarthy
Hello, I'm trying to build an ETL job which takes in 30-100gb of text data and prepares it for SparkML. I don't speak Scala so I've been trying to implement in PySpark on YARN, Spark 2.1. Despite the transformations being fairly simple, the job always fails by running out of executor memory. The

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
Ahh Makes sense - thanks for the help Sent from my iPhone On Jul 23, 2015, at 4:29 PM, Enno Shioji mailto:eshi...@gmail.com>> wrote: You need to pay a lot of money to get the full stream, so unless you are doing that, it's the sample stream! On Thu, Jul 23, 2015 at 9:26 PM, Patri

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
How can I tell if it's the sample stream or full stream ? Thanks Sent from my iPhone On Jul 23, 2015, at 4:17 PM, Enno Shioji mailto:eshi...@gmail.com>> wrote: You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter stream, and then looki