unsubscribe

2023-11-09 Thread Duflot Patrick
unsubscribe

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Patrick Tucci
.* to select I.*. This will show you the records from item that the join produces. If the first part of the code only returns one record, I expect you will see 4 distinct records returned here. Thanks, Patrick On Sun, Oct 22, 2023 at 1:29 AM Meena Rajani wrote: > Hello all: > > I am using

Re: Spark stand-alone mode

2023-09-19 Thread Patrick Tucci
Multiple applications can run at once, but you need to either configure Spark or your applications to allow that. In stand-alone mode, each application attempts to take all resources available by default. This section of the documentation has more details:

Re: Spark stand-alone mode

2023-09-15 Thread Patrick Tucci
I use Spark in standalone mode. It works well, and the instructions on the site are accurate for the most part. The only thing that didn't work for me was the start_all.sh script. Instead, I use a simple script that starts the master node, then uses SSH to connect to the worker machines and start

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
> such loss, damage or destruction. > > > > > On Thu, 17 Aug 2023 at 21:01, Patrick Tucci > wrote: > >> Hi Mich, >> >> Here are my config values from spark-defaults.conf: >> >> spark.eventLog.enabled true >> spark.eventLog.dir hdfs://10.0

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
acquires all available cluster resources when it starts. This is okay; as of right now, I am the only user of the cluster. If I add more users, they will also be SQL users, submitting queries through the Thrift server. Let me know if you have any other questions or thoughts. Thanks, Patrick On Thu

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
to this thread if the issue comes up again (hopefully it doesn't!). Thanks again, Patrick On Thu, Aug 17, 2023 at 1:54 PM Mich Talebzadeh wrote: > Hi Patrik, > > glad that you have managed to sort this problem out. Hopefully it will go > away for good. > > Still we are in the dark abou

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
that the driver didn't have enough memory to broadcast objects. After increasing the driver memory, the query runs without issue. I hope this can be helpful to someone else in the future. Thanks again for the support, Patrick On Sun, Aug 13, 2023 at 7:52 AM Mich Talebzadeh wrote: > OK I use H

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
ll responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. &

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
to Delta Lake and see if that solves the issue. Thanks again for your feedback. Patrick On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh wrote: > Hi Patrick, > > There is not anything wrong with Hive On-premise it is the best data > warehouse there is > > Hive handles both ORC and P

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
-to-delta-using-jdbc Thanks again to everyone who replied for their help. Patrick On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh wrote: > Steve may have a valid point. You raised an issue with concurrent writes > before, if I recall correctly. Since this limitation may be due to Hive >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
of the reason why I chose it. Thanks again for the reply, I truly appreciate your help. Patrick On Thu, Aug 10, 2023 at 3:43 PM Mich Talebzadeh wrote: > sorry host is 10.0.50.1 > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > London > United Kingdom > > >view m

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
hadoop -f command.sql Thanks again for your help. Patrick On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh wrote: > Can you run this sql query through hive itself? > > Are you using this command or similar for your thrift server? > > beeline -u jdbc:hive2:/

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
, but no stages or tasks are executing or pending: [image: image.png] I've let the query run for as long as 30 minutes with no additional stages, progress, or errors. I'm not sure where to start troubleshooting. Thanks for your help, Patrick

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
, Patrick On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria wrote: > Hi Patrick, > > You can have multiple writers simultaneously writing to the same table in > HDFS by utilizing an open table format with concurrency control. Several > formats, such as Apache Hudi, Apache Iceb

Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Patrick Tucci
/user/spark/warehouse/eventclaims. Is it possible to have multiple concurrent writers to the same table with Spark SQL? Is there any way to make this work? Thanks for the help. Patrick

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
. The same CTAS query only took about 45 minutes. This is still a bit slower than I had hoped, but the import from bzip fully utilized all available cores. So we can give the cluster more resources if we need the process to go faster. Patrick On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh wrote

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
d take more than 24x longer than a simple SELECT COUNT(*) statement. Thanks for any help. Please let me know if I can provide any additional information. Patrick Create Table.sql Description: Binary data - To unsubscribe e-mail

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Patrick Tucci
Window functions don't work like traditional GROUP BYs. They allow you to partition data and pull any relevant column, whether it's used in the partition or not. I'm not sure what the syntax is for PySpark, but the standard SQL would be something like this: WITH InputData AS ( SELECT 'USA'

RE: Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-22 Thread Patrick Tucci
Thanks. How would I go about formally submitting a feature request for this? On 2022/11/21 23:47:16 Andrew Melo wrote: > I think this is the right place, just a hard question :) As far as I > know, there's no "case insensitive flag", so YMMV > > On Mon, Nov 21, 2022 at

RE: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-21 Thread Patrick Tucci
Is this the wrong list for this type of question? On 2022/11/12 16:34:48 Patrick Tucci wrote: > Hello, > > Is there a way to set string comparisons to be case-insensitive globally? I > understand LOWER() can be used, but my codebase contains 27k lines of SQL > and many string

[Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-12 Thread Patrick Tucci
row(s) Desired behavior would be true for all of the above with the proposed case-insensitive flag set. Thanks, Patrick

Profiling options for PandasUDF (2.4.7 on yarn)

2021-05-28 Thread Patrick McCarthy
of (count, row_id, column_id). It works at small scale but gets unstable as I scale up. Is there a way to profile this function in a spark session or am I limited to profiling on pandas data frames without spark? -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering

Re: Issue while installing dependencies Python Spark

2020-12-18 Thread Patrick McCarthy
that risk? In either case you move about the same number of bytes around. On Fri, Dec 18, 2020 at 3:04 PM Sachit Murarka wrote: > Hi Patrick/Users, > > I am exploring wheel file form packages for this , as this seems simple:- > > > https://bytes.grubhub.com/managing-dependen

Re: Getting error message

2020-12-17 Thread Patrick McCarthy
ing code in a local machine that is single node machine. > > Getting into logs, it looked like the host is killed. This is happening > very frequently an I am unable to find the reason of this. > > Could low memory be the reason? > > On Fri, 18 Dec 2020, 00:11 Patrick McCar

Re: Getting error message

2020-12-17 Thread Patrick McCarthy
gram starts running fine. > This error goes away on > > On Thu, 17 Dec 2020, 23:50 Patrick McCarthy, > wrote: > >> my-domain.com/192.168.166.8:63534 probably isn't a valid address on your >> network, is it? >> >> On Thu, Dec 17, 2020 at 3:03 AM Vikas Garg wr

Re: Issue while installing dependencies Python Spark

2020-12-17 Thread Patrick McCarthy
path/to/venv/bin/python3 > > This did not help too.. > > Kind Regards, > Sachit Murarka > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-30 Thread Patrick McCarthy
there other Spark patterns that I should attempt in order to achieve > my end goal of a vector of attributes for every entity? > > Thanks, Daniel > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Hive using Spark engine vs native spark with hive integration.

2020-10-07 Thread Patrick McCarthy
ome of the > performance features, for example things like caching/evicting etc. > > > > > > Any advice on this is much appreciated. > > > > > > Thanks, > > -Manu > > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: regexp_extract regex for extracting the columns from string

2020-08-10 Thread Patrick McCarthy
> apart from udf,is there any way to achieved it. > > > Thanks > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
columns with list comprehensions forming a single select() statement makes for a smaller DAG. On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira wrote: > Hi Patrick, thank you for your quick response. > That's exactly what I think. Actually, the result of this processing is an > int

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
rk-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Patrick McCarthy
>> > Mukhtaj >> > >> > >> > >> > >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Building Spark 3.0.0 for Hive 1.2

2020-07-10 Thread Patrick McCarthy
ConfString(key, value) File "/home/pmccarthy/custom-spark-3/python/lib/py4j-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/home/pmccarthy/custom-spark-3/python/pyspark/sql/utils.py", line 137, in deco raise_from(converted) File "", line 3, in

Re: Reading TB of JSON file

2020-06-18 Thread Patrick McCarthy
fford having 50 GB on driver memory. In general, what > is the best practice to read large JSON file like 50 GB? > > Thanks > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Add python library

2020-06-08 Thread Patrick McCarthy
low is am >>> example: >>> >>> def do_something(p): >>> ... >>> >>> rdd = sc.parallelize([ >>> {"x": 1, "y": 2}, >>> {"x": 2, "y": 3}, >>> {"x": 3,

Re: Using existing distribution for join when subset of keys

2020-05-31 Thread Patrick Woody
You can use bucketBy to avoid shuffling in your scenario. This test suite > has some examples: > https://github.com/apache/spark/blob/45cf5e99503b00a6bd83ea94d6d92761db1a00ab/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala#L343 > > Thanks, > Terry > > On S

Using existing distribution for join when subset of keys

2020-05-31 Thread Patrick Woody
Hey all, I have one large table, A, and two medium sized tables, B & C, that I'm trying to complete a join on efficiently. The result is multiplicative on A join B, so I'd like to avoid shuffling that result. For this example, let's just assume each table has three columns, x, y, z. The below is

[Structured Streaming] Connecting to Kafka via a Custom Consumer / Producer

2020-04-22 Thread Patrick McGloin
, copy the Spark code base and swap in our custom Consumer for the KafkaConsumer used in that function (and a few other changes). This leaves us with a codebase to maintain that will be out of sync over time. Or we can build and maintain our own custom connecter. Bet regards, Patrick

Re: Save Spark dataframe as dynamic partitioned table in Hive

2020-04-16 Thread Patrick McCarthy
ION (broadcastId = > broadcastValue, brand = dummy) > > -^^^ > SELECT > ocis_party_id AS partyId > , target_mobile_no AS phoneNumber > FROM tmp > > It fails passing part

Design pattern to invert a large map

2020-03-31 Thread Patrick McCarthy
is to restage the data in a partitioned, bucketed flat table as an intermediary step but that too is costly in terms of disk space and transform time. Thanks, Patrick

Spark Mllib logistic regression setWeightCol illegal argument exception

2020-01-09 Thread Patrick
Hi Spark Users, I am trying to solve a class imbalance problem, I figured out, spark supports setting weight in its API but I get IIlegal Argument exception weight column do not exist, but it do exists in the dataset. Any recommedation to go about this problem ? I am using Pipeline API with

Re: Using Percentile in Spark SQL

2019-11-11 Thread Patrick McCarthy
oyee or agent responsible for delivering it to the intended recipient, >> you are hereby notified that any use, dissemination, distribution or >> copying of this communication and/or its content is strictly prohibited. If >> you are not the intended recipient, please immediately notify us by repl

Best practices for data like file storage

2019-11-01 Thread Patrick McCarthy
Hi List, I'm looking for resources to learn about how to store data on disk for later access. For a while my team has been using Spark on top of our existing hdfs/Hive cluster without much agency as far as what format is used to store the data. I'd like to learn more about how to re-stage my

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Patrick McCarthy
licate based on > the value of a specific column. But, I want to make sure that while > dropping duplicates, the rows from first data frame are kept. > > Example: > df1 = df1.union(df2).dropDuplicates(['id']) > > > -- *Patrick McCarthy * Senior Data Scientist, Machine

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Patrick McCarthy
>>> dhruba.w...@gmail.com>: >>>> >>>>> No, i checked for that, hence written "brand new" jupyter notebook. >>>>> Also the time taken by both are 30 mins and ~3hrs as i am reading a 500 >>>>> gigs compressed base64 encoded tex

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Patrick McCarthy
executed are also the same and from same user. > > What i found is the the quantile values for median for one ran with > jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not > able to figure out why this is happening. > > Any one faced this kind of issue b

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Patrick McCarthy
count(*) from ExtTable* via the Hive CLI, it successfully gives me the >>>> expected count of records in the table. >>>> However, when i fire the same query via sparkSQL, i get count = 0. >>>> >>>> I think the sparkSQL isn't able to descend into the subdirectories for >>>> getting the data while hive is able to do so. >>>> Are there any configurations needed to be set on the spark side so that >>>> this works as it does via hive cli? >>>> I am using Spark on YARN. >>>> >>>> Thanks, >>>> Rishikesh >>>> >>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external >>>> table, orc, sparksql, yarn >>>> >>> -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Spark Image resizing

2019-07-31 Thread Patrick McCarthy
("image").load(imageDir) >> >> Can you please help me with this? >> >> Nick >> > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Patrick McCarthy
and their > corresponding row keys need to be returned in under 5 seconds. > > 4. Users will eventually request random row/column subsets to run > calculations on, so precomputing our coefficients is not an option. This > needs to be done on request. > > > > I've been l

Re: spark python script importError problem

2019-07-16 Thread Patrick McCarthy
ror: No module named feature.user.user_feature > > The script also run well in "sbin/start-master.sh sbin/start-slave.sh",but > it has the same importError problem in "sbin/start-master.sh > sbin/start-slaves.sh".The conf/slaves contents is 'localhost'. > > W

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
looked back :) (The common memory model via Arrow is a nice boost too!) On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta wrote: > The proof is in the pudding > > :) > > > > On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta > wrote: > >> Hi Patrick, >> &

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
prove it? > > On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy > wrote: > >> I disagree that it's hype. Perhaps not 1:1 with pure scala >> performance-wise, but for python-based data scientists or others with a lot >> of python expertise it allows one to do things that would

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-05 Thread Patrick McCarthy
d it is up to the user to ensure that the grouped > data will fit into the available memory. > > Let me know about your used case in case possible > > > Regards, > Gourav > > On Sun, May 5, 2019 at 3:59 AM Rishi Shah > wrote: > >> Thanks Patrick! I tried to p

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-04-30 Thread Patrick McCarthy
his directory doesn't >> include all the packages to form a proper parcel for distribution. >> >> Any help is much appreciated! >> >> -- >> Regards, >> >> Rishi Shah >> > > > -- > Regards, > > Rishi Shah > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Spark ML with null labels

2019-01-10 Thread Patrick McCarthy
s need to do 1:1 mapping. > > On Thu, Jan 10, 2019, 7:53 AM Patrick McCarthy > >> I'm trying to implement an algorithm on the MNIST digits that runs like >> so: >> >> >>- for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 label >>to t

Spark ML with null labels

2019-01-10 Thread Patrick McCarthy
I'm trying to implement an algorithm on the MNIST digits that runs like so: - for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 label to the digits and build a LogisticRegression Classifier -- 45 in total - Fit every classifier on the test set separately - Aggregate the

Re: Need help with SparkSQL Query

2018-12-17 Thread Patrick McCarthy
Untested, but something like the below should work: from pyspark.sql import functions as F from pyspark.sql import window as W (record .withColumn('ts_rank', F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id")) .filter(F.col('ts_rank')==1) .drop('ts_rank') ) On Mon, Dec 17,

Re: Questions on Python support with Spark

2018-11-12 Thread Patrick McCarthy
I've never tried to run a stand-alone cluster alongside hadoop, but why not run Spark as a yarn application? That way it can absolutely (in fact preferably) use the distributed file system. On Fri, Nov 9, 2018 at 5:04 PM, Arijit Tarafdar wrote: > Hello All, > > > > We have a requirement to run

Re: [Spark UI] Spark 2.3.1 UI no longer respects spark.ui.retainedJobs

2018-10-25 Thread Patrick Brown
Done: https://issues.apache.org/jira/browse/SPARK-25837 On Thu, Oct 25, 2018 at 10:21 AM Marcelo Vanzin wrote: > Ah that makes more sense. Could you file a bug with that information > so we don't lose track of this? > > Thanks > On Wed, Oct 24, 2018 at 6:13 PM Patrick

Re: [Spark UI] Spark 2.3.1 UI no longer respects spark.ui.retainedJobs

2018-10-23 Thread Patrick Brown
in > memory (checked with jvisualvm). > > On Sat, Oct 20, 2018 at 6:45 PM Marcelo Vanzin > wrote: > > > > On Tue, Oct 16, 2018 at 9:34 AM Patrick Brown > > wrote: > > > I recently upgraded to spark 2.3.1 I have had these same settings in > my spark submit script, whi

[Spark UI] Spark 2.3.1 UI no longer respects spark.ui.retainedJobs

2018-10-16 Thread Patrick Brown
I recently upgraded to spark 2.3.1 I have had these same settings in my spark submit script, which worked on 2.0.2, and according to the documentation appear to not have changed: spark.ui.retainedTasks=1 spark.ui.retainedStages=1 spark.ui.retainedJobs=1 However in 2.3.1 the UI doesn't seem to

Re: Spark Structured Streaming resource contention / memory issue

2018-10-15 Thread Patrick McGloin
Hi Jungtaek, Thanks, we thought that might be the issue but haven't tested yet as building against an unreleased version of Spark is tough for us, due to network restrictions. We will try though. I will report back if we find anything. Best regards, Patrick On Fri, Oct 12, 2018, 2:57 PM

Spark Structured Streaming resource contention / memory issue

2018-10-12 Thread Patrick McGloin
dump from one of the executors as this issue is happening but I cannot see any resource they are blocked on: Are we hitting a GC problem and why is it manifesting in this way? Is there another resource that is blocking and what is it? Thanks, Patrick

Re: Python Dependencies Issue on EMR

2018-09-14 Thread Patrick McCarthy
You didn't say how you're zipping the dependencies, but I'm guessing you either include .egg files or zipped up a virtualenv. In either case, the extra C stuff that scipy and pandas rely upon doesn't get included. An approach like this solved the last problem I had that seemed like this -

Re: How to make pyspark use custom python?

2018-09-06 Thread Patrick McCarthy
It looks like for whatever reason your cluster isn't using the python you distributed, or said distribution doesn't contain what you think. I've used the following with success to deploy a conda environment to my cluster at runtime:

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If this is actually happening, it's just wasteful overhead. The ambition is to say "divide the data into partitions, but make sure you don't move it in doing so". On Tue, Aug 28, 2018 at 2:06 PM, Patrick McCar

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If this is actually happening, it's just wasteful overhead. On Tue, Aug 28, 2018 at 1:03 PM, Sonal Goyal wrote: > Hi Patrick, > > Sorry is there something here that helps you beyond repartition(number of &g

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
t. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent from my iPhone > > > On Aug 27, 2018, at 1

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
t. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent from my iPhone > > > On Aug 27, 2018, at 1

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
t. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent from my iPhone > > > On Aug 27, 2018, at 1

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
t. My question is, is there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent from my iPhone > > > On Aug 27, 2018, at 1

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
totally balanced, then I'd hope that I could save a lot of overhead with foo = df.withColumn('randkey',F.floor(1000*F.rand())).repartition(5000, 'randkey','host').apply(udf) On Tue, Aug 28, 2018 at 10:28 AM, Patrick McCarthy wrote: > Mostly I'm guessing that it adds efficiency to a job wh

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
there anything else > that you would expect to gain, except for enforcing maybe a dataset that is > already bucketed? Like you could enforce that data is where it is supposed > to be, but what else would you avoid? > > Sent from my iPhone > > > On Aug 27, 2018, at 1

Pitfalls of partitioning by host?

2018-08-27 Thread Patrick McCarthy
When debugging some behavior on my YARN cluster I wrote the following PySpark UDF to figure out what host was operating on what row of data: @F.udf(T.StringType()) def add_hostname(x): import socket return str(socket.gethostname()) It occurred to me that I could use this to enforce

Re: How to merge multiple rows

2018-08-22 Thread Patrick McCarthy
You didn't specify which API, but in pyspark you could do import pyspark.sql.functions as F df.groupBy('ID').agg(F.sort_array(F.collect_set('DETAILS')).alias('DETAILS')).show() +---++ | ID| DETAILS| +---++ | 1|[A1, A2, A3]| | 3|[B2]| | 2|[B1]|

Re: Two different Hive instances running

2018-08-17 Thread Patrick Alwell
You probably need to take a look at your hive-site.xml and see what the location is for the Hive Metastore. As for beeline, you can explicitly use an instance of Hive server by passing in the JDBC url to the hiveServer when you launch the client; e.g. beeline –u “jdbc://example.com:5432” Try

Re: How to Create one DB connection per executor and close it after the job is done?

2018-07-30 Thread Patrick McGloin
You could use an object in Scala, of which only one instance will be created on each JVM / Executor. E.g. object MyDatabseSingleton { var dbConn = ??? } On Sat, 28 Jul 2018, 08:34 kant kodali, wrote: > Hi All, > > I understand creating a connection forEachPartition but I am wondering can >

Re: Arrow type issue with Pandas UDF

2018-07-24 Thread Patrick McCarthy
Thanks Byran. I think it was ultimately groupings that were too large - after setting spark.sql.shuffle.partitions to a much higher number I was able to get the UDF to execute. On Fri, Jul 20, 2018 at 12:45 AM, Bryan Cutler wrote: > Hi Patrick, > > It looks like it's failing in Sca

Arrow type issue with Pandas UDF

2018-07-19 Thread Patrick McCarthy
PySpark 2.3.1 on YARN, Python 3.6, PyArrow 0.8. I'm trying to run a pandas UDF, but I seem to get nonsensical exceptions in the last stage of the job regardless of my output type. The problem I'm trying to solve: I have a column of scalar values, and each value on the same row has a sorted

Spark accessing fakes3

2018-07-11 Thread Patrick Roemer
(), but I'm at loss how to get it to work with a local fakes3. The only reference I've found so far is this issue, where somebody seems to have gotten close, but unfortunately he's forgotten about the details: https://github.com/jubos/fake-s3/issues/108 Thanks and best regards, Patrick

Re: DataTypes of an ArrayType

2018-07-11 Thread Patrick McCarthy
Arrays need to be a single type, I think you're looking for a Struct column. See: https://medium.com/@mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803 On Wed, Jul 11, 2018 at 6:37 AM, dimitris plakas wrote: > Hello everyone, > > I am new to Pyspark and i would like to ask if

Re: Building SparkML vectors from long data

2018-07-03 Thread Patrick McCarthy
quot;collected_val")) > .withColumn("collected_val", > toVector(col("collected_val")).as[Row](Encoders.javaSerialization(classOf[Row]))) > > > at least works. The indices still aren't in order in the vector - I don't > know if this matters much, but if it does,

Re: How to handle java.sql.Date inside Maps with to_json / from_json

2018-06-28 Thread Patrick McGloin
Hi all, I tested this with a Date outside a map and it works fine so I think the issue is simply for Dates inside Maps. I will create a Jira for this unless there are objections. Best regards, Patrick On Thu, 28 Jun 2018, 11:53 Patrick McGloin, wrote: > Consider the following test, wh

How to handle java.sql.Date inside Maps with to_json / from_json

2018-06-28 Thread Patrick McGloin
g the options to to_json / from_json but it hasn't helped. Am I using the wrong options? Is there another way to do this? Best regards, Patrick This message has been sent by ABN AMRO Bank N.V., which has its seat at Gustav Mahlerlaan 10 (1082 PP) Amsterdam, the Netherlands <https://maps.google.

Building SparkML vectors from long data

2018-06-12 Thread Patrick McCarthy
I work with a lot of data in a long format, cases in which an ID column is repeated, followed by a variable and a value column like so: +---+-+---+ |ID | var | value | +---+-+---+ | A | v1 | 1.0 | | A | v2 | 2.0 | | B | v1 | 1.5 | | B | v3 | -1.0 | +---+-+---+

Poor performance reading Hive table made of sequence files

2018-05-01 Thread Patrick McCarthy
I recently ran a query with the following form: select a.*, b.* from some_small_table a inner join ( select things from someother table lateral view explode(s) ss as sss where a_key is in (x,y,z) ) b on a.key = b.key where someothercriterion On hive, this query took about five minutes. In

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-02-28 Thread Patrick Alwell
I don’t think sql context is “deprecated” in this sense. It’s still accessible by earlier versions of Spark. But yes, at first glance it looks like you are correct. I don’t see a recordWriter method for parquet outside of the SQL package.

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread Patrick Alwell
+1 AFAIK, vCores are not the same as Cores in AWS. https://samrueby.com/2015/01/12/what-are-amazon-aws-vcpus/ I’ve always understood it as cores = num concurrent threads These posts might help you with your research and why exceeding 5 cores per executor doesn’t make sense.

Out of memory Error when using Collection Accumulator Spark 2.2

2018-02-26 Thread Patrick
Hi, We were getting OOM error when we are accumulating the results of each worker. We were trying to avoid collecting data to driver node instead used accumulator as per below code snippet, Is there any spark config to set the accumulator settings Or am i doing the wrong way to collect the huge

Reservoir sampling in parallel

2018-02-23 Thread Patrick McCarthy
? Thanks, Patrick [1] https://en.wikipedia.org/wiki/Reservoir_sampling

Re: Spark Dataframe and HIVE

2018-02-09 Thread Patrick Alwell
Might sound silly, but are you using a Hive context? What errors do the Hive query results return? spark = SparkSession.builder.enableHiveSupport().getOrCreate() The second part of your questions, you are creating a temp table and then subsequently creating another table from that temp view.

Re: Type Casting Error in Spark Data Frame

2018-01-29 Thread Patrick McCarthy
You can't select from an array like that, try instead using 'lateral view explode' in the query for that element, or before the sql stage (py)spark.sql.functions.explode. On Mon, Jan 29, 2018 at 4:26 PM, Arnav kumar wrote: > Hello Experts, > > I would need your advice in

Re: I can't save DataFrame from running Spark locally

2018-01-23 Thread Patrick Alwell
Spark cannot read locally from S3 without an S3a protocol; you’ll more than likely need a local copy of the data or you’ll need to utilize the proper jars to enable S3 communication from the edge to the datacenter.

Re: Spark vs Snowflake

2018-01-22 Thread Patrick McCarthy
Last I heard of them a year or two ago, they basically repackage AWS services behind their own API/service layer for convenience. There's probably a value-add if you're not familiar with optimizing AWS, but if you already have that expertise I don't expect they would add much extra performance if

Re: StreamingLogisticRegressionWithSGD : Multiclass Classification : Options

2018-01-19 Thread Patrick McCarthy
multiclass evaluation. On Fri, Jan 19, 2018 at 11:29 AM, Sundeep Kumar Mehta <sunnyjai...@gmail.com > wrote: > Thanks a lot Patrick, I do see a class OneVsRest classifier which only > takes classifier instance of ml package and not mlib package, do you see > any alternative for

Re: StreamingLogisticRegressionWithSGD : Multiclass Classification : Options

2018-01-18 Thread Patrick McCarthy
As a hack, you could perform a number of 1 vs. all classifiers and then post-hoc select among the highest prediction probability to assign class. On Thu, Jan 18, 2018 at 12:17 AM, Sundeep Kumar Mehta wrote: > Hi, > > I was looking for Logistic Regression with Multi Class

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Patrick Alwell
Joren, Anytime there is a shuffle in the network, Spark moves to a new stage. It seems like you are having issues either pre or post shuffle. Have you looked at a resource management tool like ganglia to determine if this is a memory or thread related issue? The spark UI? You are using

Re: Spark based Data Warehouse

2017-11-12 Thread Patrick Alwell
Alcon, You can most certainly do this. I’ve done benchmarking with Spark SQL and the TPCDS queries using S3 as the filesystem. Zeppelin and Livy server work well for the dash boarding and concurrent query issues: https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/ Livy

  1   2   3   4   >