Re: Spark writing API

2023-08-16 Thread Andrew Melo
like with arrow's off-heap storage), it's crazy inefficient to try and do the equivalent of realloc() to grow the buffer size. Thanks Andrew > On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran > wrote: > >> >> >> On Thu, 1 Jun 2023 at 00:58, Andrew Melo wrote: >>

Re: Spark writing API

2023-08-02 Thread Andrew Melo
Hello Spark Devs Could anyone help me with this? Thanks, Andrew On Wed, May 31, 2023 at 20:57 Andrew Melo wrote: > Hi all > > I've been developing for some time a Spark DSv2 plugin "Laurelin" ( > https://github.com/spark-root/laurelin > ) to read the ROOT (https

Spark writing API

2023-05-31 Thread Andrew Melo
Hi all I've been developing for some time a Spark DSv2 plugin "Laurelin" ( https://github.com/spark-root/laurelin ) to read the ROOT (https://root.cern) file format (which is used in high energy physics). I've recently presented my work in a conference (

Re: [Java] Constructing a list from ArrowBufs

2023-04-27 Thread Andrew Melo
dFieldBuffers(innerNode, Arrays.asList(null, contentBuf)); On Thu, Apr 27, 2023 at 2:20 PM Andrew Melo wrote: > > Hi all, > > I am working on a Spark datasource plugin that reads a (custom) file > format and outputs arrow-backed columns. I'm having difficulty > figuring out how

[Java] Constructing a list from ArrowBufs

2023-04-27 Thread Andrew Melo
Hi all, I am working on a Spark datasource plugin that reads a (custom) file format and outputs arrow-backed columns. I'm having difficulty figuring out how to construct a ListVector if I have an ArrowBuf with the contents and know the width of each list. I've tried constructing the buffer with

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread Andrew Melo
I'm Central US time (AKA UTC -6:00) On Tue, Feb 7, 2023 at 5:32 PM Holden Karau wrote: > > Awesome, I guess I should have asked folks for timezones that they’re in. > > On Tue, Feb 7, 2023 at 3:30 PM Andrew Melo wrote: >> >> Hello Holden, >> >> We are inter

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread Andrew Melo
Hello Holden, We are interested in Spark on k8s and would like the opportunity to speak with devs about what we're looking for slash better ways to use spark. Thanks! Andrew On Tue, Feb 7, 2023 at 5:24 PM Holden Karau wrote: > > Hi Folks, > > It seems like we could maybe use some additional

Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-21 Thread Andrew Melo
I think this is the right place, just a hard question :) As far as I know, there's no "case insensitive flag", so YMMV On Mon, Nov 21, 2022 at 5:40 PM Patrick Tucci wrote: > > Is this the wrong list for this type of question? > > On 2022/11/12 16:34:48 Patrick Tucci wrote: > > Hello, > > > >

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Andrew Melo
Hi Gourav, Since Koalas needs the same round-trip to/from JVM and Python, I expect that the performance should be nearly the same for UDFs in either API Cheers Andrew On Thu, Aug 25, 2022 at 11:22 AM Gourav Sengupta wrote: > > Hi, > > May be I am jumping to conclusions and making stupid

PySpark cores

2022-07-28 Thread Andrew Melo
Hello, Is there a way to tell Spark that PySpark (arrow) functions use multiple cores? If we have an executor with 8 cores, we would like to have a single PySpark function use all 8 cores instead of having 8 single core python functions run. Thanks! Andrew

Re: [EXTERNAL] RDD.pipe() for binary data

2022-07-16 Thread Andrew Melo
I'm curious about using shared memory to speed up the JVM->Python round trip. Is there any sane way to do anonymous shared memory in Java/scale? On Sat, Jul 16, 2022 at 16:10 Sebastian Piu wrote: > Other alternatives are to look at how PythonRDD does it in spark, you > could also try to go for

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Melo
It would certainly be useful for our domain to have some sort of native cbind(). Is there a fundamental disapproval of adding that functionality, or is it just a matter of nobody implementing it? On Wed, Apr 20, 2022 at 16:28 Sean Owen wrote: > Good lead, pandas on Spark concat() is worth

Re: Grabbing the current MemoryManager in a plugin

2022-04-13 Thread Andrew Melo
Hello, Any wisdom on the question below? Thanks Andrew On Fri, Apr 8, 2022 at 16:04 Andrew Melo wrote: > Hello, > > I've implemented support for my DSv2 plugin to back its storage with > ArrowColumnVectors, which necessarily means using off-heap memory. Is > it possible to some

Re: cannot access class sun.nio.ch.DirectBuffer

2022-04-13 Thread Andrew Melo
. > > On Wed, Apr 13, 2022 at 9:05 AM Andrew Melo wrote: > >> Hi Sean, >> >> Out of curiosity, will Java 11+ always require special flags to access >> the unsafe direct memory interfaces, or is this something that will either >> be addressed by the spec (by

Re: cannot access class sun.nio.ch.DirectBuffer

2022-04-13 Thread Andrew Melo
Hi Sean, Out of curiosity, will Java 11+ always require special flags to access the unsafe direct memory interfaces, or is this something that will either be addressed by the spec (by making an "approved" interface) or by libraries (with some other workaround)? Thanks Andrew On Tue, Apr 12,

Grabbing the current MemoryManager in a plugin

2022-04-08 Thread Andrew Melo
Hello, I've implemented support for my DSv2 plugin to back its storage with ArrowColumnVectors, which necessarily means using off-heap memory. Is it possible to somehow grab either a reference to the current MemoryManager so that the off-heap memory usage is properly accounted for and to prevent

ArrowBuf.nioBuffer() reference counting

2022-04-06 Thread Andrew Melo
Hello, When using (on Java) ArrowBuf.nioBuffer(), does care need to be taken so that the underlying ArrowBuf doesn't go out of scope? Or does it increment the reference count somewhere behind the scenes? Thanks Andrew

Re: Apache Spark 3.3 Release

2022-03-16 Thread Andrew Melo
Hello, I've been trying for a bit to get the following two PRs merged and into a release, and I'm having some difficulty moving them forward: https://github.com/apache/spark/pull/34903 - This passes the current python interpreter to spark-env.sh to allow some currently-unavailable customization

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Andrew Melo
HTH Andrew On Tue, Aug 17, 2021 at 2:29 PM Mich Talebzadeh wrote: > Hi Andrew, > > Can you please elaborate on blowing pip cache before committing the layer? > > Thanks, > > Much > > On Tue, 17 Aug 2021 at 16:57, Andrew Melo wrote: > >> Silly Q, did

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Andrew Melo
Silly Q, did you blow away the pip cache before committing the layer? That always trips me up. Cheers Andrew On Tue, Aug 17, 2021 at 10:56 Mich Talebzadeh wrote: > With no additional python packages etc we get 1.4GB compared to 2.19GB > before > > REPOSITORY TAG

Re: WholeStageCodeGen + DSv2

2021-05-19 Thread Andrew Melo
eproduce the issue you described? >> >> Bests, >> Takeshi >> >> On Wed, May 19, 2021 at 11:38 AM Andrew Melo wrote: >>> >>> Hello, >>> >>> When reading a very wide (> 1000 cols) input, WholeStageCodeGen blows >>> past

WholeStageCodeGen + DSv2

2021-05-18 Thread Andrew Melo
Hello, When reading a very wide (> 1000 cols) input, WholeStageCodeGen blows past the 64kB source limit and fails. Looking at the generated code, a big part of the code is simply the DSv2 convention that the codegen'd variable names are the same as the columns instead of something more compact

Secrets store for DSv2

2021-05-18 Thread Andrew Melo
Hello, When implementing a DSv2 datasource, where is an appropriate place to store/transmit secrets from the driver to the executors? Is there built-in spark functionality for that, or is my best bet to stash it as a member variable in one of the classes that gets sent to the executors? Thanks!

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
ch. A big goal of mine is to make it so that what was changed is recomputed, and no more, which will speed up the rate at which we can find new physics. Cheers Andrew > > On 5/17/21, 2:56 PM, "Andrew Melo" wrote: > > CAUTION: This email originated from outside of t

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
explicitly compute them themselves. Cheers Andrew On Mon, May 17, 2021 at 1:10 PM Sean Owen wrote: > > Why join here - just add two columns to the DataFrame directly? > > On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote: >> >> Anyone have ideas about the below Q? >> &g

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
t merge join Cheers Andrew On Wed, May 12, 2021 at 11:32 AM Andrew Melo wrote: > > Hi, > > In the case where the left and right hand side share a common parent like: > > df = spark.read.someDataframe().withColumn('rownum', row_number()) > df1 = df.withColumn('c1', expensive_

Re: Merge two dataframes

2021-05-12 Thread Andrew Melo
Hi, In the case where the left and right hand side share a common parent like: df = spark.read.someDataframe().withColumn('rownum', row_number()) df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum') df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum')

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Andrew Melo
Hi, Integrating Koalas with pyspark might help enable a richer integration between the two. Something that would be useful with a tighter integration is support for custom column array types. Currently, Spark takes dataframes, converts them to arrow buffers then transmits them over the socket to

Re: Java dataframe library for arrow suggestions

2021-03-16 Thread Andrew Melo
I can't speak to how complete it is, but I looked earlier for something similar and ran across https://github.com/deeplearning4j/nd4j .. it's probably not an exact fit, but it does appear to be able to consume arrow buffers and expose them to java. Cheers Andrew On Tue, Mar 16, 2021 at 6:36 PM

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Andrew Melo
Hello Ryan, This proposal looks very interesting. Would future goals for this functionality include both support for aggregation functions, as well as support for processing ColumnBatch-es (instead of Row/InternalRow)? Thanks Andrew On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue wrote: > > Thanks

Re: Spark DataFrame Creation

2020-07-22 Thread Andrew Melo
Hi Mark, On Wed, Jul 22, 2020 at 4:49 PM Mark Bidewell wrote: > > Sorry if this is the wrong place for this. I am trying to debug an issue > with this library: > https://github.com/springml/spark-sftp > > When I attempt to create a dataframe: > > spark.read. >

PySpark aggregation w/pandas_udf

2020-07-15 Thread Andrew Melo
Hi all, For our use case, we would like to perform an aggregation using a pandas_udf with dataframes that have O(100m) rows and a few 10s of columns. Conceptually, this looks a bit like pyspark.RDD.aggregate, where the user provides: * A "seqOp" which accepts pandas series(*) and outputs an

Re: REST Structured Steaming Sink

2020-07-01 Thread Andrew Melo
On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz wrote: > > I'm not sure having a built-in sink that allows you to DDOS servers is the > best idea either. foreachWriter is typically used for such use cases, not > foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, > etc.

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Andrew Melo
Hello, On Wed, Jun 24, 2020 at 2:13 PM Holden Karau wrote: > > So I thought our theory for the pypi packages was it was for local > developers, they really shouldn't care about the Hadoop version. If you're > running on a production cluster you ideally pip install from the same release >

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread Andrew Melo
Hi Maqy On Wed, Apr 22, 2020 at 3:24 AM maqy <454618...@qq.com> wrote: > > I will traverse this Dataset to convert it to Arrow and send it to Tensorflow > through Socket. (I presume you're using the python tensorflow API, if you're not, please ignore) There is a JIRA/PR ([1] [2]) which

Re: DSv2 & DataSourceRegister

2020-04-16 Thread Andrew Melo
Hi again, Does anyone have thoughts on either the idea or the implementation? Thanks, Andrew On Thu, Apr 9, 2020 at 11:32 PM Andrew Melo wrote: > > Hi all, > > I've opened a WIP PR here https://github.com/apache/spark/pull/28159 > I'm a novice at Scala, so I'm sure the code

Re: DSv2 & DataSourceRegister

2020-04-09 Thread Andrew Melo
hanks again, Andrew On Wed, Apr 8, 2020 at 10:27 AM Andrew Melo wrote: > > On Wed, Apr 8, 2020 at 8:35 AM Wenchen Fan wrote: > > > > It would be good to support your use case, but I'm not sure how to > > accomplish that. Can you open a PR so that we can di

Re: DSv2 & DataSourceRegister

2020-04-08 Thread Andrew Melo
t; > On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo wrote: >> >> Hello >> >> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan wrote: >>> >>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not >>> sure this is possible as the DS V2 AP

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Andrew Melo
n from META-INF and pass in the full class name to the DataFrameReader. Thanks Andrew > On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo wrote: > >> Hi Ryan, >> >> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue wrote: >> > >> > Hi Andrew, >> > >>

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Andrew Melo
oth interfaces. Thanks again, Andrew > > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo wrote: >> >> Hi all, >> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I >> send an email to the dev list for discussion. >> >> As the DSv2

DSv2 & DataSourceRegister

2020-04-07 Thread Andrew Melo
Hi all, I posted an improvement ticket in JIRA and Hyukjin Kwon requested I send an email to the dev list for discussion. As the DSv2 API evolves, some breaking changes are occasionally made to the API. It's possible to split a plugin into a "common" part and multiple version-specific parts and

Re: Scala version compatibility

2020-04-06 Thread Andrew Melo
not included in my artifact except for this single callsite. Thanks Andrew > On Mon, Apr 6, 2020 at 4:16 PM Andrew Melo wrote: > >> >> >> On Mon, Apr 6, 2020 at 3:08 PM Koert Kuipers wrote: >> >>> yes it will >>> >>> >> Ooof, I

Re: Scala version compatibility

2020-04-06 Thread Andrew Melo
) Thanks for your help, Andrew > On Mon, Apr 6, 2020 at 3:50 PM Andrew Melo wrote: > >> Hello all, >> >> I'm aware that Scala is not binary compatible between revisions. I have >> some Java code whose only Scala dependency is the transitive dependency >> thr

Scala version compatibility

2020-04-06 Thread Andrew Melo
Hello all, I'm aware that Scala is not binary compatible between revisions. I have some Java code whose only Scala dependency is the transitive dependency through Spark. This code calls a Spark API which returns a Seq, which I then convert into a List with JavaConverters.seqAsJavaListConverter.

Optimizing LIMIT in DSv2

2020-03-30 Thread Andrew Melo
Hello, Executing "SELECT Muon_Pt FROM rootDF LIMIT 10", where "rootDF" is a temp view backed by a DSv2 reader yields the attached plan [1]. It appears that the initial stage is run over every partition in rootDF, even though each partition has 200k rows (modulo the last partition which holds the

Supporting Kryo registration in DSv2

2020-03-26 Thread Andrew Melo
Hello all, Is there a way to register classes within a datasourcev2 implementation in the Kryo serializer? I've attempted the following in both the constructor and static block of my toplevel class: SparkContext context = SparkContext.getOrCreate(); SparkConf conf =

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-14 Thread Andrew Melo
dle and the desire to increase utillzation. Thanks Andrew Sean > > On Fri, Mar 13, 2020 at 6:33 PM Andrew Melo wrote: > > > > Hi Xingbo, Sean, > > > > On Fri, Mar 13, 2020 at 12:31 PM Xingbo Jiang > wrote: > >> > >> Andrew, could you provide mor

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-13 Thread Andrew Melo
dicated k8s/mesos/yarn clusters we use for prototyping > Thanks, > > Xingbo > > On Fri, Mar 13, 2020 at 10:23 AM Sean Owen wrote: > >> You have multiple workers in one Spark (standalone) app? this wouldn't >> prevent N apps from each having a worker on a machine. >>

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-13 Thread Andrew Melo
Hello, On Fri, Feb 28, 2020 at 13:21 Xingbo Jiang wrote: > Hi all, > > Based on my experience, there is no scenario that necessarily requires > deploying multiple Workers on the same node with Standalone backend. A > worker should book all the resources reserved to Spark on the host it is >

Re: (java) Producing an in-memory Arrow buffer from a file

2020-01-24 Thread Andrew Melo
ps://github.com/apache/arrow/tree/master/docs/source/java > > On Thu, Jan 23, 2020 at 5:02 AM Andrew Melo wrote: > >> Hello all, >> >> I work in particle physics, which has standardized on the ROOT ( >> http://root.cern) file format to store/process our data. The

(java) Producing an in-memory Arrow buffer from a file

2020-01-23 Thread Andrew Melo
Hello all, I work in particle physics, which has standardized on the ROOT ( http://root.cern) file format to store/process our data. The format itself is quite complicated, but the relevant part here is that after parsing/decompression, we end up with value and offset buffers holding our data.

Re: Reading 7z file in spark

2020-01-14 Thread Andrew Melo
It only makes sense if the underlying file is also splittable, and even then, it doesn't really do anything for you if you don't explicitly tell spark about the split boundaries On Tue, Jan 14, 2020 at 7:36 PM Someshwar Kale wrote: > I would suggest to use other compression technique which is

Re: how to get partition column info in Data Source V2 writer

2019-12-17 Thread Andrew Melo
Hi Aakash On Tue, Dec 17, 2019 at 12:42 PM aakash aakash wrote: > Hi Spark dev folks, > > First of all kudos on this new Data Source v2, API looks simple and it > makes easy to develop a new data source and use it. > > With my current work, I am trying to implement a new data source V2 writer >

Re: DSv2 reader lifecycle

2019-11-06 Thread Andrew Melo
hey are created. > That's good to know, I'll search around JIRA for docs describing that functionality. Thanks again, Andrew > > rb > > On Tue, Nov 5, 2019 at 4:58 PM Andrew Melo wrote: > >> Hello, >> >> During testing of our DSv2 implementation (on 2.4.3 FW

DSv2 reader lifecycle

2019-11-05 Thread Andrew Melo
Hello, During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears that our DataSourceReader is being instantiated multiple times for the same dataframe. For example, the following snippet Dataset df = spark .read()

Re: Exposing functions to pyspark

2019-10-08 Thread Andrew Melo
:48 PM Andrew Melo wrote: > > Hello, > > I'm working on a DSv2 implementation with a userbase that is 100% pyspark > based. > > There's some interesting additional DS-level functionality I'd like to > expose from the Java side to pyspark -- e.g. I/O metrics, which source

Re: Driver vs master

2019-10-07 Thread Andrew Melo
ot;local" master and "client mode" then yes tasks execute in the same JVM as the driver". The answer depends on the exact setup Amit has and how the application is configured > HTH... > > Ayan > > > > On Tue, Oct 8, 2019 at 12:11 PM Andrew Melo wrote: > &

Re: Driver vs master

2019-10-07 Thread Andrew Melo
a time. I understand that. I think there's a misunderstanding with the terminology, though. Are you running multiple separate spark instances on a single machine or one instance with multiple jobs inside. > > On Monday, October 7, 2019, Andrew Melo wrote: > >> Hi Amit >>

Re: Driver vs master

2019-10-07 Thread Andrew Melo
Hi Amit On Mon, Oct 7, 2019 at 18:33 Amit Sharma wrote: > Can you please help me understand this. I believe driver programs runs on > master node If we are running 4 spark job and driver memory config is 4g then total 16 > 6b would be used of master node. This depends on what master/deploy

Exposing functions to pyspark

2019-09-30 Thread Andrew Melo
Hello, I'm working on a DSv2 implementation with a userbase that is 100% pyspark based. There's some interesting additional DS-level functionality I'd like to expose from the Java side to pyspark -- e.g. I/O metrics, which source site provided the data, etc... Does someone have an example of

Re: Thoughts on Spark 3 release, or a preview release

2019-09-13 Thread Andrew Melo
Hi Spark Aficionados- On Fri, Sep 13, 2019 at 15:08 Ryan Blue wrote: > +1 for a preview release. > > DSv2 is quite close to being ready. I can only think of a couple issues > that we need to merge, like getting a fix for stats estimation done. I'll > have a better idea once I've caught up from

DSV2 API Question

2019-06-25 Thread Andrew Melo
Hello, I've (nearly) implemented a DSV2-reader interface to read particle physics data stored in the ROOT (https://root.cern.ch/) file format. You can think of these ROOT files as roughly parquet-like: column-wise and nested (i.e. a column can be of type "float[]", meaning each row in the column

Re: Detect executor core count

2019-06-18 Thread Andrew Melo
> >> case _: NoSuchElementException => >> >> // If spark.executor.cores is not defined, get the cores per JVM >> >> import spark.implicits._ >> >> val numMachineCores = spark.range(0, 1) >> >>

Detect executor core count

2019-06-18 Thread Andrew Melo
Hello, Is there a way to detect the number of cores allocated for an executor within a java-based InputPartitionReader? Thanks! Andrew

Re: DataSourceV2Reader Q

2019-05-21 Thread Andrew Melo
hich I improperly passing in instead of Metadata.empty() Thanks again, Andrew > > On Tue, May 21, 2019 at 11:39 AM Andrew Melo wrote: >> >> Hello, >> >> I'm developing a DataSourceV2 reader for the ROOT (https://root.cern/) >> file format to replace a previous DSV

DataSourceV2Reader Q

2019-05-21 Thread Andrew Melo
Hello, I'm developing a DataSourceV2 reader for the ROOT (https://root.cern/) file format to replace a previous DSV1 source that was in use before. I have a bare skeleton of the reader, which can properly load the files and pass their schema into Spark 2.4.3, but any operation on the resulting

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi, On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta wrote: > > Hence, what I mentioned initially does sound correct ? I don't agree at all - we've had a significant boost from moving to regular UDFs to pandas UDFs. YMMV, of course. > > On Mon, May 6, 2019 at 5:43 PM Andrew

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi, On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy wrote: > > Thanks Gourav. > > Incidentally, since the regular UDF is row-wise, we could optimize that a bit > by taking the convert() closure and simply making that the UDF. > > Since there's that MGRS object that we have to create too, we

Re: can't download 2.4.1 sourcecode

2019-04-22 Thread Andrew Melo
On Mon, Apr 22, 2019 at 10:54 PM yutaochina wrote: > > >

Re: Connecting to Spark cluster remotely

2019-04-22 Thread Andrew Melo
Hi Rishkesh On Mon, Apr 22, 2019 at 4:26 PM Rishikesh Gawade wrote: > > To put it simply, what are the configurations that need to be done on the > client machine so that it can run driver on itself and executors on > spark-yarn cluster nodes? TBH, if it were me, I would simply SSH to the

DataSourceV2 exceptions

2019-04-08 Thread Andrew Melo
Hello, I'm developing a (java) DataSourceV2 to read a columnar fileformat popular in a number of physical sciences (https://root.cern.ch/). (I also understand that the API isn't fixed and subject to change). My question is -- what is the expected way to transmit exceptions from the DataSource up

Re: [ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-05 Thread Andrew Melo
On Fri, Apr 5, 2019 at 9:41 AM Jungtaek Lim wrote: > > Thanks Andrew for reporting this. I just submitted the fix. > https://github.com/apache/spark/pull/24304 Thanks! > > On Fri, Apr 5, 2019 at 3:21 PM Andrew Melo wrote: >> >> Hello, >> >> I'm not sur

Re: [ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-05 Thread Andrew Melo
Hello, I'm not sure if this is the proper place to report it, but the 2.4.1 version of the config docs apparently didn't render right into HTML (scroll down to "Compression and Serialization") https://spark.apache.org/docs/2.4.1/configuration.html#available-properties By comparison, the 2.4.0

Re: Where does the Driver run?

2019-03-25 Thread Andrew Melo
Hi Pat, Indeed, I don't think that it's possible to use cluster mode w/o spark-submit. All the docs I see appear to always describe needing to use spark-submit for cluster mode -- it's not even compatible with spark-shell. But it makes sense to me -- if you want Spark to run your application's

Re: Where does the Driver run?

2019-03-24 Thread Andrew Melo
Hi Pat, On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel wrote: > Thanks, I have seen this many times in my research. Paraphrasing docs: “in > deployMode ‘cluster' the Driver runs on a Worker in the cluster” > > When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1 > with addresses

Re: SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Andrew Melo
Hi, On Fri, Mar 1, 2019 at 9:48 AM Xingbo Jiang wrote: > > Hi Sean, > > To support GPU scheduling with YARN cluster, we have to update the hadoop > version to 3.1.2+. However, if we decide to not upgrade hadoop to beyond that > version for Spark 3.0, then we just have to disable/fallback the

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
e'll need to calculate the sum of their 4-d momenta, while samples with <2 electrons will need subtract two different physical quantities -- several more steps before we get to the point where we'll histogram the different subsamples for the outputs. Cheers Andrew > > On Mon, Feb 4, 2019 at

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
so it's possible we're not using it correctly. Cheers Andrew > rb > > On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo wrote: >> >> Hello >> >> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini wrote: >> > >> > I've seen many application need to split data

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hello On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini wrote: > > I've seen many application need to split dataset to multiple datasets based > on some conditions. As there is no method to do it in one place, developers > use filter method multiple times. I think it can be useful to have method

Please stop asking to unsubscribe

2019-01-31 Thread Andrew Melo
The correct way to unsubscribe is to mail user-unsubscr...@spark.apache.org Just mailing the list with "unsubscribe" doesn't actually do anything... Thanks Andrew - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Andrew Melo
Could you join() the DFs on a common key? On Fri, Dec 28, 2018 at 18:35 wrote: > Shabad , I am not sure what you are trying to say. Could you please give > me an example? The result of the Query is a Dataframe that is created after > iterating, so I am not sure how could I map that to a column

Questions about caching

2018-12-11 Thread Andrew Melo
Greetings, Spark Aficionados- I'm working on a project to (ab-)use PySpark to do particle physics analysis, which involves iterating with a lot of transformations (to apply weights and select candidate events) and reductions (to produce histograms of relevant physics objects). We have a basic

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
just getting started). > > On Mon, Aug 27, 2018 at 12:18 PM Andrew Melo wrote: >> >> Hi Holden, >> >> I'm agnostic to the approach (though it seems cleaner to have an >> explicit API for it). If you would like, I can take that JIRA and >> implement it

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
bly add `getActiveSession` to the PySpark > API (filed a starter JIRA https://issues.apache.org/jira/browse/SPARK-25255 > ) > > On Mon, Aug 27, 2018 at 12:09 PM Andrew Melo wrote: >> >> Hello Sean, others - >> >> Just to confirm, is it OK for client

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
, 2018 at 5:52 PM, Andrew Melo wrote: > Hi Sean, > > On Tue, Aug 7, 2018 at 5:44 PM, Sean Owen wrote: >> Ah, python. How about SparkContext._active_spark_context then? > > Ah yes, that looks like the right member, but I'm a bit wary about > depending on functionality

Re: SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
; and subject to change. Is that something I should be unconcerned about. The other thought is that the accesses with SparkContext are protected by "SparkContext._lock" -- should I also use that lock? Thanks for your help! Andrew > > On Tue, Aug 7, 2018 at 5:34 PM Andr

Re: SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
ion and causing a JVM to start. Is there an easy way to call getActiveSession that doesn't start a JVM? Cheers Andrew > > On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo wrote: >> >> Hello, >> >> One pain point with various Jupyter extensions [1][2] that provide >&

SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
Hello, One pain point with various Jupyter extensions [1][2] that provide visual feedback about running spark processes is the lack of a public API to introspect the web URL. The notebook server needs to know the URL to find information about the current SparkContext. Simply looking for

[slurm-users] Stagein/Stageout

2018-01-05 Thread Andrew Melo
, but that doesn't seem to be the same. Any suggestions? Andrew -- -- Andrew Melo

Re: [foreman-users] Strange behavior connecting to proxy

2017-02-13 Thread Andrew Melo
Hello, Sorry for the delay responding. That was just it. I was confused because telnet on the same machine worked. Cheers, Andrew On Thu, Jan 26, 2017 at 2:22 AM, Dominic Cleal <domi...@cleal.org> wrote: > On 24/01/17 16:57, Andrew Melo wrote: >> Hi all, >> >> I have

[foreman-users] Strange behavior connecting to proxy

2017-01-24 Thread Andrew Melo
Hi all, I have a proof-of-concept foreman instance running in a VM with a proxy on the machine connected to our IPMI network. For some reason, trying to add the proxy to my foreman instance yields: Unable to communicate with the proxy: ERF12-2530 [ProxyAPI::ProxyException]: Unable to detect

Re: [ANNOUNCE] New Parquet PMC member - Wes McKinney

2016-08-23 Thread Andrew Melo
Congrats, Wes! On Tue, Aug 23, 2016 at 11:12 AM, Julien Le Dem <jul...@dremio.com> wrote: > On behalf of the Apache Parquet PMC I'm pleased to announce that > Wes McKinney has accepted to join the PMC. > > Welcome Wes! > > -- > Julien -- -- Andrew Melo

Re: [CMake] Does Makefile generated by CMake support make -jN?

2016-07-12 Thread Andrew Melo
Make mailing list archive at Nabble.com. > -- > > Powered by www.kitware.com > > Please keep messages on-topic and check the CMake FAQ at: > http://www.cmake.org/Wiki/CMake_FAQ > > Kitware offers various services to support the CMake community. For more > informati

Re: Custom metrics

2016-07-11 Thread Andrew Melo
rg/display/JENKINS/Plot+Plugin (though > I've never used it) > This works great! Thanks. > > --- Matt > > On Sunday, May 22, 2016 at 9:22:27 PM UTC-7, Andrew Melo wrote: >> >> Hello, >> >> I had two short questions about what was possible with pip

New NPE with ghprb?

2016-07-11 Thread Andrew Melo
merging of code. What can be done to get more information about this? Everything else works up until that point, so I assume there's not a connectivity problem between my host and GH, so I'm unsure of where to start searching. Thanks! Andrew -- -- Andrew Melo -- You received this message

Re: [CMake] CPack: setting multiple RPM pre/post (un)install scripts

2016-05-29 Thread Andrew Melo
Hi On Sunday, May 29, 2016, Thomas Teo wrote: > Hi All, > In building an RPM package, I'd like to set multiple (un)install scripts - > an application specific one which starts services that are installed by the > RPM, and also to call ldconfig so that the shared library

Custom metrics

2016-05-22 Thread Andrew Melo
ks! Andrew -- -- Andrew Melo -- You received this message because you are subscribed to the Google Groups "Jenkins Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-users+unsubscr...@googlegroups.com. To view this discussion on

Re: Pipeline: "Branch not mergable"

2016-04-26 Thread Andrew Melo
ment to GH for the branch source and the GH Api plugins -- hopefully that will close the loop. Thanks, Andrew On Monday, April 25, 2016 at 12:40:36 PM UTC-5, Andrew Melo wrote: > > Hi all, > > Fairly often (~50% of the time), when I push a feature branch up to > GH, my multibranch c

Re: Trigger arbitrary build/post-build steps from pipeline

2016-04-25 Thread Andrew Melo
ping? Is the shiny new pipeline functionality simply incompatible with all of the existing plugins? On Thursday, April 21, 2016 at 2:34:00 PM UTC-5, Andrew Melo wrote: > > Hi, > > I've dug a bit more, perhaps that will help find a solution. It appears > the "step" groo

Pipeline: "Branch not mergable"

2016-04-25 Thread Andrew Melo
master (I'm the only one commiting to master currently, no rebases of the feature branch, etc..), and there's no other status tests that would be keeping it from building. It seems like the actual commit notification is getting scrambled. Has anyone else seen this? Cheers, Andrew -- -- Andrew Melo

  1   2   3   4   >