Unsubscribe

2024-01-13 Thread Andrew Redd
Unsubscribe

Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
e: > try it without spaces? > export SPARK_LOCAL_DIRS="/tmp,/share/" > > On Fri, Jan 12, 2024 at 5:00 PM Andrew Petersen > wrote: > >> Hello Spark community >> >> SPARK_LOCAL_DIRS or >> spark.local.dir >> is supposed to accept a list.

Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
Without spaces was the first thing I tried. The information in the pdf file inspired me to try the space. On Fri, Jan 12, 2024 at 10:23 PM Koert Kuipers wrote: > try it without spaces? > export SPARK_LOCAL_DIRS="/tmp,/share/" > > On Fri, Jan 12, 2024 at 5:00 PM An

[spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
rs. However, for me, Spark is only considering the 1st directory on the list: export SPARK_LOCAL_DIRS="/tmp, /share/" I am using Spark 3.4.1. Does anyone have any experience getting this to work? If so can you suggest a simple example I can try and tell me which version of Spark you are u

Unsubscribe

2023-12-16 Thread Andrew Milkowski

APACHE Spark adoption/growth chart

2023-09-12 Thread Andrew Petersen
, but found no plots. Regards -- Andrew Petersen, PhD Advanced Computing, Office of Information Technology 2620 Hillsborough Street datascience.oit.ncsu.edu

Re: Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-09 Thread Andrew Redd
remove On Wed, Apr 5, 2023 at 8:06 AM Mich Talebzadeh wrote: > OK Spark Structured Streaming. > > How are you getting messages into Spark? Is it Kafka? > > This to me index that the message is incomplete or having another value in > Json > > HTH > > Mich Talebzadeh, > Lead Solutions

Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-21 Thread Andrew Melo
I think this is the right place, just a hard question :) As far as I know, there's no "case insensitive flag", so YMMV On Mon, Nov 21, 2022 at 5:40 PM Patrick Tucci wrote: > > Is this the wrong list for this type of question? > > On 2022/11/12 16:34:48 Patrick Tucci wrote: > > Hello, > > > >

CVE-2022-33891 mitigation

2022-11-21 Thread Andrew Pomponio
at position I am not able to find this in VisualVM or MAT to determine what that is set to. Any thoughts? Andrew Pomponio | Associate Enterprise Architect, OpenLogic<https://www.openlogic.com/?utm_leadsource=email-signature_source=outlook-direct-email_medium=email_campaign=2019-common_cont

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Andrew Melo
Hi Gourav, Since Koalas needs the same round-trip to/from JVM and Python, I expect that the performance should be nearly the same for UDFs in either API Cheers Andrew On Thu, Aug 25, 2022 at 11:22 AM Gourav Sengupta wrote: > > Hi, > > May be I am jumping to conclusions and m

PySpark cores

2022-07-28 Thread Andrew Melo
Hello, Is there a way to tell Spark that PySpark (arrow) functions use multiple cores? If we have an executor with 8 cores, we would like to have a single PySpark function use all 8 cores instead of having 8 single core python functions run. Thanks! Andrew

Re: [EXTERNAL] RDD.pipe() for binary data

2022-07-16 Thread Andrew Melo
I'm curious about using shared memory to speed up the JVM->Python round trip. Is there any sane way to do anonymous shared memory in Java/scale? On Sat, Jul 16, 2022 at 16:10 Sebastian Piu wrote: > Other alternatives are to look at how PythonRDD does it in spark, you > could also try to go for

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
at 5:34 PM To: Andrew Davidson Cc: Andrew Melo , Bjørn Jørgensen , "user @spark" Subject: Re: How is union() implemented? Need to implement column bind Wait, how is all that related to cbind -- very different from what's needed to insert. BigQuery is unrelated to MR or Spark. It

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
. Kind regards Andy From: Sean Owen Date: Wednesday, April 20, 2022 at 2:31 PM To: Andrew Melo Cc: Andrew Davidson , Bjørn Jørgensen , "user @spark" Subject: Re: How is union() implemented? Need to implement column bind I don't think there's fundamental disapproval (it is implemented i

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Melo
oin. I don't know if >>> there's a better option in a SQL engine, as SQL doesn't have anything to >>> offer except join and pivot either (? right?) >>> Certainly, the dominant data storage paradigm is wide tables, whereas >>> you're starting with effec

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
it spark will work well for our need Kind regards Andy From: Sean Owen Date: Monday, April 18, 2022 at 6:58 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: How is union() implemented? Need to implement column bind A join is the natural answer, but this is a 10114-way join, whic

How is union() implemented? Need to implement column bind

2022-04-18 Thread Andrew Davidson
Hi have a hard problem I have 10114 column vectors each in a separate file. The file has 2 columns, the row id, and numeric values. The row ids are identical and in sort order. All the column vectors have the same number of rows. There are over 5 million rows. I need to combine them into a

Re: Grabbing the current MemoryManager in a plugin

2022-04-13 Thread Andrew Melo
Hello, Any wisdom on the question below? Thanks Andrew On Fri, Apr 8, 2022 at 16:04 Andrew Melo wrote: > Hello, > > I've implemented support for my DSv2 plugin to back its storage with > ArrowColumnVectors, which necessarily means using off-heap memory. Is > it possible to some

Re: cannot access class sun.nio.ch.DirectBuffer

2022-04-13 Thread Andrew Melo
Gotcha. Seeing as there's a lot of large projects who used the unsafe API either directly or indirectly (via netty, etc..) it's a bit surprising that it was so thoroughly closed off without an escape hatch, but I'm sure there was a lively discussion around it... Cheers Andrew On Wed, Apr 13

Re: cannot access class sun.nio.ch.DirectBuffer

2022-04-13 Thread Andrew Melo
Hi Sean, Out of curiosity, will Java 11+ always require special flags to access the unsafe direct memory interfaces, or is this something that will either be addressed by the spec (by making an "approved" interface) or by libraries (with some other workaround)? Thanks Andrew On T

Grabbing the current MemoryManager in a plugin

2022-04-08 Thread Andrew Melo
inadvertently OOM-ing the system? Thanks Andrew - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: pivoting panda dataframe

2022-03-15 Thread Andrew Davidson
: Tuesday, March 15, 2022 at 2:19 PM To: Andrew Davidson Cc: Mich Talebzadeh , "user @spark" Subject: Re: pivoting panda dataframe Hi Andrew. Mitch asked, and I answered transpose() https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.tran

Re: pivoting panda dataframe

2022-03-15 Thread Andrew Davidson
Hi Bjorn I have been looking for spark transform for a while. Can you send me a link to the pyspark function? I assume pandas transform is not really an option. I think it will try to pull the entire dataframe into the drivers memory. Kind regards Andy p.s. My real problem is that spark

Re: Does spark have something like rowsum() in R?

2022-02-09 Thread Andrew Davidson
at 8:19 AM To: Andrew Davidson Cc: "user @spark" Subject: Re: Does spark have something like rowsum() in R? It really depends on what is running out of memory. You can have all the workers in the world but if something is blowing up the driver, won't do anything. You can have a hu

Re: Does spark have something like rowsum() in R?

2022-02-09 Thread Andrew Davidson
took about 40 min. Spark would crash after about 12 hrs. column bind and row sums are common operations. Seem like there should be an easy solution? Maybe I should submit a RFE (request for enhancement) Kind regards Andy From: Sean Owen Date: Tuesday, February 8, 2022 at 8:57 AM To: Andrew

Does spark have something like rowsum() in R?

2022-02-08 Thread Andrew Davidson
As part of my data normalization process I need to calculate row sums. The following code works on smaller test data sets. It does not work on my big tables. When I run on a table with over 10,000 columns I get an OOM on a cluster with 2.8 TB. Is there a better way to implement this Kind

Does spark support something like the bind function in R?

2022-02-08 Thread Andrew Davidson
I need to create a single table by selecting one column from thousands of files. The columns are all of the same type, have the same number of rows and rows names. I am currently using join. I get OOM on mega-mem cluster with 2.8 TB. Does spark have something like cbind() “Take a sequence of

Re: What are your experiences using google cloud platform

2022-01-24 Thread Andrew Davidson
/a/54283997/4586180 retDF = countsSparkDF.na.fill( 0 ).withColumn( newColName , reduce( add, [col( x ) for x in columnNames] ) ) self.logger.warn( "rowSumsImpl END\n" ) return retDF From: Mich Talebzadeh Date: Monday, January 24, 2022 at 12:54 AM To: Andrew Dav

What are your experiences using google cloud platform

2022-01-23 Thread Andrew Davidson
Hi recently started using GCP dataproc spark. Seem to have trouble getting big jobs to complete. I am using check points. I am wondering if maybe I should look for another cloud solution Kind regards Andy

Re: How to configure log4j in pyspark to get log level, file name, and line number

2022-01-21 Thread Andrew Davidson
regards Andy From: Andrew Davidson Date: Thursday, January 20, 2022 at 2:32 PM To: "user @spark" Subject: How to configure log4j in pyspark to get log level, file name, and line number Hi When I use python logging for my unit test. I am able to control the output format. I get the

Is user@spark indexed by google?

2022-01-21 Thread Andrew Davidson
There is a ton of great info in this archive. I noticed when I do a google search it does not seem to find results from this source Kind regards Andy

How to configure log4j in pyspark to get log level, file name, and line number

2022-01-20 Thread Andrew Davidson
Hi When I use python logging for my unit test. I am able to control the output format. I get the log level, the file and line number, then the msg [INFO testEstimatedScalingFactors.py:166 - test_B_convertCountsToInts()] BEGIN In my spark driver I am able to get the log4j logger spark

java.lang.StackOverflow Error How to sum across rows in a data frame with a large number of columns

2022-01-20 Thread Andrew Davidson
Hi I have a dataframe of integers. It has 10409 columns. How can I sum across each row? I get a very long stack trace rowSums BEGIN 2022-01-20 22:11:24 ERROR __main__:? - An error occurred while calling o93935.withColumn. : java.lang.StackOverflowError at

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Andrew Davidson
Thanks! I will take a look Andy From: Gourav Sengupta Date: Tuesday, January 11, 2022 at 8:42 AM To: Andrew Davidson Cc: Andrew Davidson , "user @spark" Subject: Re: How to add a row number column with out reordering my data frame Hi, I do not think we need to do any of that.

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Andrew Davidson
functions.monotonically_increasing_id.html The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. Comments and suggestions appreciated Andy From: Gourav Sengupta Date: Monday, January 10, 2022 at 11:03 AM To: Andrew Davidson Cc: "user @spark" Su

How to add a row number column with out reordering my data frame

2022-01-06 Thread Andrew Davidson
Hi I am trying to work through a OOM error. I have 10411 files. I want to select a single column from each file and then join them into a single table. The files have a row unique id. However it is a very long string. The data file with just the name and column of interest is about 470 M. The

Re: Newbie pyspark memory mgmt question

2022-01-05 Thread Andrew Davidson
Thanks Sean Andy From: Sean Owen Date: Wednesday, January 5, 2022 at 3:38 PM To: Andrew Davidson , Nicholas Gustafson Cc: "user @spark" Subject: Re: Newbie pyspark memory mgmt question There is no memory leak, no. You can .cache() or .persist() DataFrames, and that can use me

Newbie pyspark memory mgmt question

2022-01-05 Thread Andrew Davidson
Hi I am running into OOM problems. My cluster should be much bigger than I need. I wonder if it has to do with the way I am writing my code. Below are three style cases. I wonder if they cause memory to be leaked? Case 1 : df1 = spark.read.load( cvs file) df1 = df1.someTransform() df1 =

Joining many tables Re: Pyspark debugging best practices

2022-01-03 Thread Andrew Davidson
#rawCountsSDF.explain() self.logger.info( "END\n" ) return retNumReadsDF From: David Diebold Date: Monday, January 3, 2022 at 12:39 AM To: Andrew Davidson , "user @spark" Subject: Re: Pyspark debugging best practices Hello Andy, Are you sure you wa

Re: Pyspark debugging best practices

2021-12-30 Thread Andrew Davidson
: > Hi Andrew, > > Any chance you might give Databricks a try in GCP? > > The above transformations look complicated to me, why are you adding > dataframes to a list? > > > Regards, > Gourav Sengupta > > > > On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson

Pyspark debugging best practices

2021-12-26 Thread Andrew Davidson
Hi I am having trouble debugging my driver. It runs correctly on smaller data set but fails on large ones. It is very hard to figure out what the bug is. I suspect it may have something do with the way spark is installed and configured. I am using google cloud platform dataproc pyspark The

Pyspark garbage collection and cache management best practices

2021-12-26 Thread Andrew Davidson
Hi Below is typical pseudo code I find myself writing over and over again. There is only a single action at the very end of the program. The early narrow transformations potentially hold on to a lot of needless data. I have a for loop over join. (ie wide transformation). Followed by a bunch

Re: About some Spark technical help

2021-12-24 Thread Andrew Davidson
à 11:58, Gourav Sengupta > a écrit : > >> Hi, >> >> out of sheer and utter curiosity, why JAVA? >> >> Regards, >> Gourav Sengupta >> >> On Thu, Dec 23, 2021 at 5:10 PM sam smith >> wrote: >> >>> Hi Andrew, >>>

OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Andrew Davidson
te: Friday, December 24, 2021 at 8:30 AM To: Gourav Sengupta Cc: Andrew Davidson , Nicholas Gustafson , User Subject: Re: AnalysisException: Trouble using select() to append multiple columns (that's not the situation below we are commenting on) On Fri, Dec 24, 2021, 9:28 AM Gourav Sengupta mail

Re: About some Spark technical help

2021-12-23 Thread Andrew Davidson
Hi Sam Can you tell us more? What is the algorithm? Can you send us the URL the publication Kind regards Andy From: sam smith Date: Wednesday, December 22, 2021 at 10:59 AM To: "user@spark.apache.org" Subject: About some Spark technical help Hello guys, I am replicating a paper's

Re: ??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

2021-12-21 Thread Andrew Davidson
ew with name "rawCounts", spark3 would uncache the > previous "rawCounts". > > Correct me if I'm wrong. > > Regards > > > On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson > wrote: > >> Happy Holidays >> >> >> >>

??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

2021-12-20 Thread Andrew Davidson
Happy Holidays I am a newbie I have 16,000 data files, all files have the same number of rows and columns. The row ids are identical and are in the same order. I want to create a new data frame that contains the 3rd column from each data file. My pyspark script runs correctly when I test on

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-18 Thread Andrew Davidson
Thanks Nicholas Andy From: Nicholas Gustafson Date: Friday, December 17, 2021 at 6:12 PM To: Andrew Davidson Cc: "user@spark.apache.org" Subject: Re: AnalysisException: Trouble using select() to append multiple columns Since df1 and df2 are different DataFrames, you will need to

AnalysisException: Trouble using select() to append multiple columns

2021-12-17 Thread Andrew Davidson
Hi I am a newbie I have 16,000 data files, all files have the same number of rows and columns. The row ids are identical and are in the same order. I want to create a new data frame that contains the 3rd column from each data file I wrote a test program that uses a for loop and Join. It works

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
ch. A big goal of mine is to make it so that what was changed is recomputed, and no more, which will speed up the rate at which we can find new physics. Cheers Andrew > > On 5/17/21, 2:56 PM, "Andrew Melo" wrote: > > CAUTION: This email originated from outside of t

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
explicitly compute them themselves. Cheers Andrew On Mon, May 17, 2021 at 1:10 PM Sean Owen wrote: > > Why join here - just add two columns to the DataFrame directly? > > On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote: >> >> Anyone have ideas about the below Q? >> &g

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
t merge join Cheers Andrew On Wed, May 12, 2021 at 11:32 AM Andrew Melo wrote: > > Hi, > > In the case where the left and right hand side share a common parent like: > > df = spark.read.someDataframe().withColumn('rownum', row_number()) > df1 = df.withColumn('c1', expensive_

Re: Merge two dataframes

2021-05-12 Thread Andrew Melo
? Thanks Andrew On Wed, May 12, 2021 at 11:07 AM Sean Owen wrote: > > Yeah I don't think that's going to work - you aren't guaranteed to get 1, 2, > 3, etc. I think row_number() might be what you need to generate a join ID. > > RDD has a .zip method, but (unless I'm forgetting!)

unsubscribe

2021-01-24 Thread Andrew Milkowski

Re: [Pyspark 3 Debug] Date values reset to Unix epoch

2020-09-24 Thread Andrew Mullins
t_date " "SELECT * FROM test_date" ) print("Temp Table:") print(spark.table("test_date").collect()) print("Final Table:") print(spark.table("test_db.test_date").collect()) Output: Temp Table: [Row(id='1234', date=dateti

[Pyspark 3 Debug] Date values reset to Unix epoch

2020-09-24 Thread Andrew Mullins
. Does anyone have any ideas what might be causing this? Best, Andrew Mullins -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark DataFrame Creation

2020-07-22 Thread Andrew Melo
e Spark worker. Why would some elements be executing on the > driver? Looking at the code, it appears that your sftp plugin downloads the file to a local location and opens from there. https://github.com/springml/spark-sftp/blob/090917547001574afa93cddaf2a022151a3f4260/src/main/scala/com/springml/s

PySpark aggregation w/pandas_udf

2020-07-15 Thread Andrew Melo
ion ID, then do the groupBy against that column to make smaller groups, but that's not great -- it's a lot of extra data movement. Am I missing something obvious? Or is this simply a part of the API that's not fleshed out yet? Thanks Andrew * Unfortunately, Pandas' data model is less rich than spa

Re: REST Structured Steaming Sink

2020-07-01 Thread Andrew Melo
On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz wrote: > > I'm not sure having a built-in sink that allows you to DDOS servers is the > best idea either. foreachWriter is typically used for such use cases, not > foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, > etc.

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread Andrew Melo
s, not so great if pandas isn't what you operate on. The JIRA above would let you receive the arrow buffers (that already exist) directly. Cheers, Andrew [1] https://issues.apache.org/jira/browse/SPARK-30153 [2] https://github.com/apache/spark/pull/26783 > > I tried to use toLocalI

Re: Scala version compatibility

2020-04-06 Thread Andrew Melo
not included in my artifact except for this single callsite. Thanks Andrew > On Mon, Apr 6, 2020 at 4:16 PM Andrew Melo wrote: > >> >> >> On Mon, Apr 6, 2020 at 3:08 PM Koert Kuipers wrote: >> >>> yes it will >>> >>> >> Ooof, I

Re: Scala version compatibility

2020-04-06 Thread Andrew Melo
) Thanks for your help, Andrew > On Mon, Apr 6, 2020 at 3:50 PM Andrew Melo wrote: > >> Hello all, >> >> I'm aware that Scala is not binary compatible between revisions. I have >> some Java code whose only Scala dependency is the transitive dependency >> thr

Scala version compatibility

2020-04-06 Thread Andrew Melo
! Andrew

Optimizing LIMIT in DSv2

2020-03-30 Thread Andrew Melo
artition which holds the remainder of rows in a file). Is there some sort of hinting that can done from the datasource side to better inform the optimizer or, alternately, am I missing an interface in the PushDown filters that would let me elide transferring/decompressing unnecessary partitions?

Supporting Kryo registration in DSv2

2020-03-26 Thread Andrew Melo
add the registration manually either. Any thoughts? Thanks Andrew

Re: Reading 7z file in spark

2020-01-14 Thread Andrew Melo
It only makes sense if the underlying file is also splittable, and even then, it doesn't really do anything for you if you don't explicitly tell spark about the split boundaries On Tue, Jan 14, 2020 at 7:36 PM Someshwar Kale wrote: > I would suggest to use other compression technique which is

Reading Dataset from DB2 over JDBC

2020-01-14 Thread Andrew A
rts ";N*.N*" like "my column name;N*.N*" (column name contains whitespaces). How can i prevent this kind of error? Thank you. Regards, Andrew. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Recover RFormula Column Names

2019-10-29 Thread Andrew Redd
Thanks Alessandro! That did the trick. I all of the indices and interactions are in the metadata. I also wanted to confirm that this solution works in pyspark as the metadata is carried over. Andrew On Tue, Oct 29, 2019 at 5:26 AM Alessandro Solimando < alessandro.solima...@gmail.com>

Fwd: Recover RFormula Column Names

2019-10-28 Thread Andrew Redd
back post regression to map to coefficient values? Do I need to basically rebuild the RFormula logic in if this isn't already implemented? Would be happy to use a different Spark language (Scala/Java etc. ) if implemented there. Thanks in advance Andrew rform = RFormula(formula="log_ou

Re: Driver vs master

2019-10-07 Thread Andrew Melo
ot;local" master and "client mode" then yes tasks execute in the same JVM as the driver". The answer depends on the exact setup Amit has and how the application is configured > HTH... > > Ayan > > > > On Tue, Oct 8, 2019 at 12:11 PM Andrew Melo wrote: > &

Re: Driver vs master

2019-10-07 Thread Andrew Melo
Hi On Mon, Oct 7, 2019 at 19:20 Amit Sharma wrote: > Thanks Andrew but I am asking specific to driver memory not about > executors memory. We have just one master and if each jobs driver.memory=4g > and master nodes total memory is 16gb then we can not execute more than 4 > jobs at

Re: Driver vs master

2019-10-07 Thread Andrew Melo
Hi Amit On Mon, Oct 7, 2019 at 18:33 Amit Sharma wrote: > Can you please help me understand this. I believe driver programs runs on > master node If we are running 4 spark job and driver memory config is 4g then total 16 > 6b would be used of master node. This depends on what master/deploy

Re: Custom aggregations: modular and lightweight solutions?

2019-08-13 Thread Andrew Leverentz
t this example working (for arbitrary values of rowSize), I suspect that it would also give me a solution to the custom-aggregation issue I outlined in my previous email. Any suggestions would be much appreciated. Thanks, ~ Andrew On Mon, Aug 12, 2019 at 5:31 PM Andrew Leverentz < andrew

Custom aggregations: modular and lightweight solutions?

2019-08-12 Thread Andrew Leverentz
runtime error messages, respectively: - "TestCase3 is not a valid external type for schema of struct" - "scala.Some is not a valid external type for schema of string" Would anyone be able to help me diagnose the runtime errors with approach (2), or to suggest a better alternative? Thanks, ~ Andrew

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi, On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta wrote: > > Hence, what I mentioned initially does sound correct ? I don't agree at all - we've had a significant boost from moving to regular UDFs to pandas UDFs. YMMV, of course. > > On Mon, May 6, 2019 at 5:43 PM Andrew

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
y in Spark should also add some good performance in the future. Cheers Andrew > > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta > wrote: >> >> The proof is in the pudding >> >> :) >> >> >> >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengup

Re: can't download 2.4.1 sourcecode

2019-04-22 Thread Andrew Melo
On Mon, Apr 22, 2019 at 10:54 PM yutaochina wrote: > > >

Re: Connecting to Spark cluster remotely

2019-04-22 Thread Andrew Melo
Hi Rishkesh On Mon, Apr 22, 2019 at 4:26 PM Rishikesh Gawade wrote: > > To put it simply, what are the configurations that need to be done on the > client machine so that it can run driver on itself and executors on > spark-yarn cluster nodes? TBH, if it were me, I would simply SSH to the

Re: Where does the Driver run?

2019-03-25 Thread Andrew Melo
driver, you need to package it up and send it to the cluster manager. You can't start spark one place and then later migrate it to the cluster. It's also why you can't use spark-shell in cluster mode either, I think. Cheers Andrew On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel wrote: > In the

Re: Where does the Driver run?

2019-03-24 Thread Andrew Melo
wonder if > the Driver only runs in the cluster if we use spark-submit > Where/how are you starting "./sbin/start-master.sh"? Cheers Andrew > > > > From: Akhil Das > Reply: Akhil Das > Date: March 23, 2019 at 9:26:50 PM > To: Pat Ferrel &g

Seemingly wasteful memory duplication in LDAModel getTopicDistributionMethod()

2019-02-22 Thread Andrew Mathis
, Andrew

Please stop asking to unsubscribe

2019-01-31 Thread Andrew Melo
The correct way to unsubscribe is to mail user-unsubscr...@spark.apache.org Just mailing the list with "unsubscribe" doesn't actually do anything... Thanks Andrew - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2019-01-30 Thread Andrew Milkowski
unsubscribe

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Andrew Melo
Could you join() the DFs on a common key? On Fri, Dec 28, 2018 at 18:35 wrote: > Shabad , I am not sure what you are trying to say. Could you please give > me an example? The result of the Query is a Dataframe that is created after > iterating, so I am not sure how could I map that to a column

Dataset experimental interfaces

2018-12-18 Thread Andrew Old
We are running Spark 2.2.0 in a hadoop cluster and I worked on a proof of concept to read event based data into Spark Datasets and operating over those sets to calculate differences between the event data. More specifically, ordered position data with odometer values and wanting to calculate the

Questions about caching

2018-12-11 Thread Andrew Melo
er to manually cache their dataframes by doing save/loads to external files using "alluxio://" URIs. Is there no way around this behavior now? Sorry for the long email, and thanks! Andrew - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Andrew Ash
That source repo is at https://github.com/palantir/spark/ with artifacts published to Palantir's bintray at https://palantir.bintray.com/releases/org/apache/spark/ If you're seeing any of them in Maven Central please flag, as that's a mistake! Andrew On Tue, Jan 9, 2018 at 10:10 AM, Sean Owen

Re: json_tuple fails to parse string with emoji

2017-01-26 Thread Andrew Ehrlich
It looks like I'm hitting this bug in jackson-core 2.2.3 which is included in the version of CDH I'm on: https://github.com/FasterXML/jackson-core/issues/115 Jackson-core 2.3.0 has the fix. On Tue, Jan 24, 2017 at 5:14 PM, Andrew Ehrlich <and...@aehrlich.com> wrote: > On Spark 1.6.0

Re: freeing up memory occupied by processed Stream Blocks

2017-01-25 Thread Andrew Milkowski
/spark/streaming/dstream/DStream.scala#L463). > > You can control this behaviour by StreamingContext#remember to some extent. > > // maropu > > > On Fri, Jan 20, 2017 at 3:17 AM, Andrew Milkowski <amgm2...@gmail.com> > wrote: > >> hello >> >> using

json_tuple fails to parse string with emoji

2017-01-24 Thread Andrew Ehrlich
On Spark 1.6.0, calling json_tuple() with an emoji character in one of the values returns nulls: Input: """ "myJsonBody": { "field1": "" } """ Query: """ ... LATERAL VIEW JSON_TUPLE(e.myJsonBody,'field1') k AS field1, ... """ This looks like a platform-dependent issue; the parsing

freeing up memory occupied by processed Stream Blocks

2017-01-19 Thread Andrew Milkowski
hello using spark 2.0.2 and while running sample streaming app with kinesis noticed (in admin ui Storage tab) "Stream Blocks" for each worker keeps climbing up then also (on same ui page) in Blocks section I see blocks such as below input-0-1484753367056 that are marked as Memory Serialized

Re: Spark ANSI SQL Support

2017-01-17 Thread Andrew Ash
Rishabh, Have you come across any ANSI SQL queries that Spark SQL didn't support? I'd be interested to hear if you have. Andrew On Tue, Jan 17, 2017 at 8:14 PM, Deepak Sharma <deepakmc...@gmail.com> wrote: > From spark documentation page: > Spark SQL can now run all 99 TP

Re: Running Spark on EMR

2017-01-15 Thread Andrew Holway
use yarn :) "spark-submit --master yarn" On Sun, Jan 15, 2017 at 7:55 PM, Darren Govoni <dar...@ontrenet.com> wrote: > So what was the answer? > > > > Sent from my Verizon, Samsung Galaxy smartphone > > ---- Original message > From: Andre

Re: Running Spark on EMR

2017-01-15 Thread Andrew Holway
Darn. I didn't respond to the list. Sorry. On Sun, Jan 15, 2017 at 5:29 PM, Marco Mistroni wrote: > thanks Neil. I followed original suggestion from Andrw and everything is > working fine now > kr > > On Sun, Jan 15, 2017 at 4:27 PM, Neil Jonkers

python environments with "local" and "yarn-client" - Boto failing on HDP2.5

2016-11-29 Thread Andrew Holway
e on Mesos and EMR. Does HDP PySpark2 run within a virtualenv or something? Cheers, Andrew -- Otter Networks UG http://otternetworks.de Gotenstraße 17 10829 Berlin

Re: createDataFrame causing a strange error.

2016-11-29 Thread Andrew Holway
Hi Marco, I was not able to find out what was causing the problem but a "git stash" seems to have fixed it :/ Thanks for your help... :) On Mon, Nov 28, 2016 at 10:50 PM, Marco Mistroni <mmistr...@gmail.com> wrote: > Hi Andrew, > sorry but to me it seems s3 is th

Re: createDataFrame causing a strange error.

2016-11-28 Thread Andrew Holway
; extra complexity which you dont need > > If you send a snippet ofyour json content, then everyone on the list can > run the code and try to reproduce > > > hth > > Marco > > > On 27 Nov 2016 7:33 pm, "Andrew Holway" <andrew.hol...@otternetworks.de

Re: createDataFrame causing a strange error.

2016-11-27 Thread Andrew Holway
.protocol.Py4JError: An error occurred while calling o33.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Ga

createDataFrame causing a strange error.

2016-11-27 Thread Andrew Holway
Hi, Can anyone tell me what is causing this error Spark 2.0.0 Python 2.7.5 df = sqlContext.createDataFrame(foo, schema) https://gist.github.com/mooperd/368e3453c29694c8b2c038d6b7b4413a Traceback (most recent call last): File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py",

javac - No such file or directory

2016-11-09 Thread Andrew Holway
park"): error=2, No such file or directory Any ideas? Thanks, Andrew Full output: [success] created output: /home/spark/spark/external/kafka-0-10/target [info] Compiling 2 Scala sources and 8 Java sources to /home/spark/spark/common/tags/target/scala-2.11/classes... java.io.IOException: Cannot r

  1   2   3   4   5   6   >