Profiling options for PandasUDF (2.4.7 on yarn)

2021-05-28 Thread Patrick McCarthy
of (count, row_id, column_id). It works at small scale but gets unstable as I scale up. Is there a way to profile this function in a spark session or am I limited to profiling on pandas data frames without spark? -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering

Re: Issue while installing dependencies Python Spark

2020-12-18 Thread Patrick McCarthy
ocumentation linked above for further help. > > Original error was: No module named 'numpy.core._multiarray_umath' > > > > Kind Regards, > Sachit Murarka > > > On Thu, Dec 17, 2020 at 9:24 PM Patrick McCarthy > wrote: > >> I'm not very familiar with the e

Re: Getting error message

2020-12-17 Thread Patrick McCarthy
ing code in a local machine that is single node machine. > > Getting into logs, it looked like the host is killed. This is happening > very frequently an I am unable to find the reason of this. > > Could low memory be the reason? > > On Fri, 18 Dec 2020, 00:11 Patrick McCar

Re: Getting error message

2020-12-17 Thread Patrick McCarthy
gram starts running fine. > This error goes away on > > On Thu, 17 Dec 2020, 23:50 Patrick McCarthy, > wrote: > >> my-domain.com/192.168.166.8:63534 probably isn't a valid address on your >> network, is it? >> >> On Thu, Dec 17, 2020 at 3:03 AM Vikas Garg wr

Re: Issue while installing dependencies Python Spark

2020-12-17 Thread Patrick McCarthy
path/to/venv/bin/python3 > > This did not help too.. > > Kind Regards, > Sachit Murarka > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-30 Thread Patrick McCarthy
there other Spark patterns that I should attempt in order to achieve > my end goal of a vector of attributes for every entity? > > Thanks, Daniel > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Hive using Spark engine vs native spark with hive integration.

2020-10-07 Thread Patrick McCarthy
ome of the > performance features, for example things like caching/evicting etc. > > > > > > Any advice on this is much appreciated. > > > > > > Thanks, > > -Manu > > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: regexp_extract regex for extracting the columns from string

2020-08-10 Thread Patrick McCarthy
> apart from udf,is there any way to achieved it. > > > Thanks > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
was avoiding UDFs for now because I'm still on Spark 2.4 and the > python udfs I tried had very bad performance, but I will give it a try in > this case. It can't be worse. > Thanks again! > > Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy < > pmccar...@dstillery.com&g

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
rk-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Patrick McCarthy
>> > Mukhtaj >> > >> > >> > >> > >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Building Spark 3.0.0 for Hive 1.2

2020-07-10 Thread Patrick McCarthy
ConfString(key, value) File "/home/pmccarthy/custom-spark-3/python/lib/py4j-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/home/pmccarthy/custom-spark-3/python/pyspark/sql/utils.py", line 137, in deco raise_from(converted) File "", line 3, in

Re: Reading TB of JSON file

2020-06-18 Thread Patrick McCarthy
fford having 50 GB on driver memory. In general, what > is the best practice to read large JSON file like 50 GB? > > Thanks > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Add python library

2020-06-08 Thread Patrick McCarthy
low is am >>> example: >>> >>> def do_something(p): >>> ... >>> >>> rdd = sc.parallelize([ >>> {"x": 1, "y": 2}, >>> {"x": 2, "y": 3}, >>> {"x": 3,

Re: Save Spark dataframe as dynamic partitioned table in Hive

2020-04-16 Thread Patrick McCarthy
ION (broadcastId = > broadcastValue, brand = dummy) > > -^^^ > SELECT > ocis_party_id AS partyId > , target_mobile_no AS phoneNumber > FROM tmp > > It fails passing part

Design pattern to invert a large map

2020-03-31 Thread Patrick McCarthy
I'm not a software engineer by training and I hope that there's an existing best practice for the problem I'm trying to solve. I'm using Spark 2.4.5, Hadoop 2.7, Hive 1.2. I have a large table (terabytes) from an external source (which is beyond my control) where the data is stored in a key-value

Re: Using Percentile in Spark SQL

2019-11-11 Thread Patrick McCarthy
oyee or agent responsible for delivering it to the intended recipient, >> you are hereby notified that any use, dissemination, distribution or >> copying of this communication and/or its content is strictly prohibited. If >> you are not the intended recipient, please immediately notify us by repl

Best practices for data like file storage

2019-11-01 Thread Patrick McCarthy
Hi List, I'm looking for resources to learn about how to store data on disk for later access. For a while my team has been using Spark on top of our existing hdfs/Hive cluster without much agency as far as what format is used to store the data. I'd like to learn more about how to re-stage my

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Patrick McCarthy
licate based on > the value of a specific column. But, I want to make sure that while > dropping duplicates, the rows from first data frame are kept. > > Example: > df1 = df1.union(df2).dropDuplicates(['id']) > > > -- *Patrick McCarthy * Senior Data Scientist, Machine

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Patrick McCarthy
>>> dhruba.w...@gmail.com>: >>>> >>>>> No, i checked for that, hence written "brand new" jupyter notebook. >>>>> Also the time taken by both are 30 mins and ~3hrs as i am reading a 500 >>>>> gigs compressed base64 encoded tex

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Patrick McCarthy
executed are also the same and from same user. > > What i found is the the quantile values for median for one ran with > jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not > able to figure out why this is happening. > > Any one faced this kind of issue b

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Patrick McCarthy
count(*) from ExtTable* via the Hive CLI, it successfully gives me the >>>> expected count of records in the table. >>>> However, when i fire the same query via sparkSQL, i get count = 0. >>>> >>>> I think the sparkSQL isn't able to descend into the subdirectories for >>>> getting the data while hive is able to do so. >>>> Are there any configurations needed to be set on the spark side so that >>>> this works as it does via hive cli? >>>> I am using Spark on YARN. >>>> >>>> Thanks, >>>> Rishikesh >>>> >>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external >>>> table, orc, sparksql, yarn >>>> >>> -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Spark Image resizing

2019-07-31 Thread Patrick McCarthy
("image").load(imageDir) >> >> Can you please help me with this? >> >> Nick >> > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Patrick McCarthy
and their > corresponding row keys need to be returned in under 5 seconds. > > 4. Users will eventually request random row/column subsets to run > calculations on, so precomputing our coefficients is not an option. This > needs to be done on request. > > > > I've been l

Re: spark python script importError problem

2019-07-16 Thread Patrick McCarthy
ror: No module named feature.user.user_feature > > The script also run well in "sbin/start-master.sh sbin/start-slave.sh",but > it has the same importError problem in "sbin/start-master.sh > sbin/start-slaves.sh".The conf/slaves contents is 'localhost'. > > W

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
>> an array. >> >> >> Regards, >> Gourav Sengupta >> >> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy >> wrote: >> >>> Human time is considerably more expensive than computer time, so in that >>> regard, yes :) >>>

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
prove it? > > On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy > wrote: > >> I disagree that it's hype. Perhaps not 1:1 with pure scala >> performance-wise, but for python-based data scientists or others with a lot >> of python expertise it allows one to do things that would

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-05 Thread Patrick McCarthy
ackage it according to this instructions, it >> got distributed on the cluster however the same spark program that takes 5 >> mins without pandas UDF has started to take 25mins... >> >> Have you experienced anything like this? Also is Pyarrow 0.12 supported >> with Spark

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-04-30 Thread Patrick McCarthy
his directory doesn't >> include all the packages to form a proper parcel for distribution. >> >> Any help is much appreciated! >> >> -- >> Regards, >> >> Rishi Shah >> > > > -- > Regards, > > Rishi Shah > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Spark ML with null labels

2019-01-10 Thread Patrick McCarthy
s need to do 1:1 mapping. > > On Thu, Jan 10, 2019, 7:53 AM Patrick McCarthy > >> I'm trying to implement an algorithm on the MNIST digits that runs like >> so: >> >> >>- for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 label >>to t

Spark ML with null labels

2019-01-10 Thread Patrick McCarthy
I'm trying to implement an algorithm on the MNIST digits that runs like so: - for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 label to the digits and build a LogisticRegression Classifier -- 45 in total - Fit every classifier on the test set separately - Aggregate the

Re: Need help with SparkSQL Query

2018-12-17 Thread Patrick McCarthy
Untested, but something like the below should work: from pyspark.sql import functions as F from pyspark.sql import window as W (record .withColumn('ts_rank', F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id")) .filter(F.col('ts_rank')==1) .drop('ts_rank') ) On Mon, Dec 17,

Re: Questions on Python support with Spark

2018-11-12 Thread Patrick McCarthy
I've never tried to run a stand-alone cluster alongside hadoop, but why not run Spark as a yarn application? That way it can absolutely (in fact preferably) use the distributed file system. On Fri, Nov 9, 2018 at 5:04 PM, Arijit Tarafdar wrote: > Hello All, > > > > We have a requirement to run

Re: Python Dependencies Issue on EMR

2018-09-14 Thread Patrick McCarthy
You didn't say how you're zipping the dependencies, but I'm guessing you either include .egg files or zipped up a virtualenv. In either case, the extra C stuff that scipy and pandas rely upon doesn't get included. An approach like this solved the last problem I had that seemed like this -

Re: How to make pyspark use custom python?

2018-09-06 Thread Patrick McCarthy
It looks like for whatever reason your cluster isn't using the python you distributed, or said distribution doesn't contain what you think. I've used the following with success to deploy a conda environment to my cluster at runtime:

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If this is actually happening, it's just wasteful overhead. The ambition is to say "divide the data into partitions, but make sure you don't move it in doing so". On Tue, Aug 28, 2018 at 2:06 PM, Patrick McCar

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
oyal> > > > > On Tue, Aug 28, 2018 at 8:29 PM, Patrick McCarthy < > pmccar...@dstillery.com.invalid> wrote: > >> Mostly I'm guessing that it adds efficiency to a job where partitioning >> is required but shuffling is not. >> >> For example, if I want

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
t. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent from my iPhone > > > On Aug 27, 2018, at 1

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
t. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent from my iPhone > > > On Aug 27, 2018, at 1

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
t. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent from my iPhone > > > On Aug 27, 2018, at 1

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
t. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent from my iPhone > > > On Aug 27, 2018, at 1

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
totally balanced, then I'd hope that I could save a lot of overhead with foo = df.withColumn('randkey',F.floor(1000*F.rand())).repartition(5000, 'randkey','host').apply(udf) On Tue, Aug 28, 2018 at 10:28 AM, Patrick McCarthy wrote: > Mostly I'm guessing that it adds efficiency to a job wh

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent from my iPhone > > > On Aug 27, 2018, at 1

Pitfalls of partitioning by host?

2018-08-27 Thread Patrick McCarthy
When debugging some behavior on my YARN cluster I wrote the following PySpark UDF to figure out what host was operating on what row of data: @F.udf(T.StringType()) def add_hostname(x): import socket return str(socket.gethostname()) It occurred to me that I could use this to enforce

Re: How to merge multiple rows

2018-08-22 Thread Patrick McCarthy
You didn't specify which API, but in pyspark you could do import pyspark.sql.functions as F df.groupBy('ID').agg(F.sort_array(F.collect_set('DETAILS')).alias('DETAILS')).show() +---++ | ID| DETAILS| +---++ | 1|[A1, A2, A3]| | 3|[B2]| | 2|[B1]|

Re: Arrow type issue with Pandas UDF

2018-07-24 Thread Patrick McCarthy
fraction of the data? If the problem persists could > you make a jira? At the very least a better exception would be nice. > > Bryan > > On Thu, Jul 19, 2018, 7:07 AM Patrick McCarthy > > wrote: > >> PySpark 2.3.1 on YARN, Python 3.6, PyArrow 0.8. >> >>

Arrow type issue with Pandas UDF

2018-07-19 Thread Patrick McCarthy
PySpark 2.3.1 on YARN, Python 3.6, PyArrow 0.8. I'm trying to run a pandas UDF, but I seem to get nonsensical exceptions in the last stage of the job regardless of my output type. The problem I'm trying to solve: I have a column of scalar values, and each value on the same row has a sorted

Re: DataTypes of an ArrayType

2018-07-11 Thread Patrick McCarthy
Arrays need to be a single type, I think you're looking for a Struct column. See: https://medium.com/@mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803 On Wed, Jul 11, 2018 at 6:37 AM, dimitris plakas wrote: > Hello everyone, > > I am new to Pyspark and i would like to ask if

Re: Building SparkML vectors from long data

2018-07-03 Thread Patrick McCarthy
quot;collected_val")) > .withColumn("collected_val", > toVector(col("collected_val")).as[Row](Encoders.javaSerialization(classOf[Row]))) > > > at least works. The indices still aren't in order in the vector - I don't > know if this matters much, but if it does,

Building SparkML vectors from long data

2018-06-12 Thread Patrick McCarthy
I work with a lot of data in a long format, cases in which an ID column is repeated, followed by a variable and a value column like so: +---+-+---+ |ID | var | value | +---+-+---+ | A | v1 | 1.0 | | A | v2 | 2.0 | | B | v1 | 1.5 | | B | v3 | -1.0 | +---+-+---+

Poor performance reading Hive table made of sequence files

2018-05-01 Thread Patrick McCarthy
I recently ran a query with the following form: select a.*, b.* from some_small_table a inner join ( select things from someother table lateral view explode(s) ss as sss where a_key is in (x,y,z) ) b on a.key = b.key where someothercriterion On hive, this query took about five minutes. In

Reservoir sampling in parallel

2018-02-23 Thread Patrick McCarthy
I have a large dataset composed of scores for several thousand segments, and the timestamps at which time those scores occurred. I'd like to apply some techniques like reservoir sampling[1], where for every segment I process records in order of their timestamps, generate a sample, and then at

Re: Type Casting Error in Spark Data Frame

2018-01-29 Thread Patrick McCarthy
You can't select from an array like that, try instead using 'lateral view explode' in the query for that element, or before the sql stage (py)spark.sql.functions.explode. On Mon, Jan 29, 2018 at 4:26 PM, Arnav kumar wrote: > Hello Experts, > > I would need your advice in

Re: Spark vs Snowflake

2018-01-22 Thread Patrick McCarthy
Last I heard of them a year or two ago, they basically repackage AWS services behind their own API/service layer for convenience. There's probably a value-add if you're not familiar with optimizing AWS, but if you already have that expertise I don't expect they would add much extra performance if

Re: StreamingLogisticRegressionWithSGD : Multiclass Classification : Options

2018-01-19 Thread Patrick McCarthy
using OneVsRest with StreamingLogisticRegressionWithSGD > ? > > Regards > Sundeep > > On Thu, Jan 18, 2018 at 8:18 PM, Patrick McCarthy <pmccar...@dstillery.com > > wrote: > >> As a hack, you could perform a number of 1 vs. all classifiers and then >> post-hoc select

Re: StreamingLogisticRegressionWithSGD : Multiclass Classification : Options

2018-01-18 Thread Patrick McCarthy
As a hack, you could perform a number of 1 vs. all classifiers and then post-hoc select among the highest prediction probability to assign class. On Thu, Jan 18, 2018 at 12:17 AM, Sundeep Kumar Mehta wrote: > Hi, > > I was looking for Logistic Regression with Multi Class

Re: PySpark - Expand rows into dataframes via function

2017-10-03 Thread Patrick McCarthy
anges; you need to dynamically construct structs > based on the range values. Hope this helps. > > > Thanks > > Sathish > > On Mon, Oct 2, 2017 at 9:01 AM Patrick McCarthy <pmccar...@dstillery.com> > wrote: > >> Hello, >> >> I'm trying to map ARIN regis

PySpark - Expand rows into dataframes via function

2017-10-02 Thread Patrick McCarthy
Hello, I'm trying to map ARIN registry files into more explicit IP ranges. They provide a number of IPs in the range (here it's 8192) and a starting IP, and I'm trying to map it into all the included /24 subnets. For example, Input: array(['arin', 'US', 'ipv4', '23.239.160.0', 8192, 20131104.0,

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Patrick McCarthy
You might benefit from watching this JIRA issue - https://issues.apache.org/jira/browse/SPARK-19071 On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem wrote: > Is there a way to parallelize multiple ML algorithms in Spark. My use case > is something like this: > > A) Run

Re: Memory problems with simple ETL in Pyspark

2017-04-16 Thread Patrick McCarthy
rease number of executors >> >> Ayan >> >> >> On Sat, 15 Apr 2017 at 2:10 am, Patrick McCarthy <pmccar...@dstillery.com> >> wrote: >> >>> Hello, >>> >>> I'm trying to build an ETL job which takes in 30-100gb of text data a

Memory problems with simple ETL in Pyspark

2017-04-14 Thread Patrick McCarthy
Hello, I'm trying to build an ETL job which takes in 30-100gb of text data and prepares it for SparkML. I don't speak Scala so I've been trying to implement in PySpark on YARN, Spark 2.1. Despite the transformations being fairly simple, the job always fails by running out of executor memory.

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
How can I tell if it's the sample stream or full stream ? Thanks Sent from my iPhone On Jul 23, 2015, at 4:17 PM, Enno Shioji eshi...@gmail.commailto:eshi...@gmail.com wrote: You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
, Patrick McCarthy pmccar...@eatonvance.commailto:pmccar...@eatonvance.com wrote: How can I tell if it's the sample stream or full stream ? Thanks Sent from my iPhone On Jul 23, 2015, at 4:17 PM, Enno Shioji eshi...@gmail.commailto:eshi...@gmail.com wrote: You are probably listening to the sample