[GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-01-12 Thread Boileau, Brad
? Sincerely, Brad Boileau Senior Product Architect / Architecte produit sénior Farm Credit Canada | Financement agricole Canada 1820 Hamilton Street / 1820, rue Hamilton Regina SK S4P 2B8 Tel/Tél. : 306-359, C/M: 306-737-8900 fcc.ca<https://www.fcc-fac.ca/en.html> / fac.ca<https://www.f

[Spark Web UI] Integrating Keycloak SSO

2022-04-18 Thread Solomon, Brad
As outlined at https://issues.apache.org/jira/browse/SPARK-38693 and https://stackoverflow.com/q/71667296/7954504, we are attempting to integrate Keycloak Single Sign On with the Spark Web UI. However, Spark errors

Proper use of map(..., Encoder)

2016-12-14 Thread Brad Cox
ctors.dense(v); return featureVec; } }, Encoders.bean(Vector.class)); Dr. Brad J. CoxCell: 703-594-1883 Skype: dr.brad.cox - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

NoClassDefFoundError: org/apache/spark/Logging in SparkSession.getOrCreate

2016-10-15 Thread Brad Cox
spark-sql_2.11 2.0.1 Dr. Brad J. CoxCell: 703-594-1883 Skype: dr.brad.cox - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark.executor.cores

2016-07-15 Thread Brad Cox
a bunch of random experimenting around. Dr. Brad J. CoxCell: 703-594-1883 Skype: dr.brad.cox - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Infinite recursion in createDataFrame for avro types

2016-04-11 Thread Brad Cox
() { return SCHEMA$; }. Does anybody have any idea what's causing this and how to get around it? Dr. Brad J. CoxCell: 703-594-1883 Skype: dr.brad.cox > On Apr 10, 2016, at 12:51 PM, Brad Cox <bradj...@gmail.com> wrote: > > I'm getting a StackOverflowError from inside the

Infinite recursion in createDataFrame for avro types

2016-04-10 Thread Brad Cox
an", "doc": "empty, always unset"}, { "name": "missedbytes", "type": "int", "doc": "Number of missing bytes in content gaps"}, { "name": "history", "type": "stri

Re: Cluster launch

2015-02-13 Thread Brad
I'm playing around with Spark on Windows and have to worker nodes running by starting them manually using a script that contains the following set SPARK_HOME=C:\dev\programs\spark-1.2.0 set SPARK_MASTER_IP=master.brad.com spark-class org.apache.spark.deploy.worker.Worker

Re: Spark nature of file split

2015-02-10 Thread Brad
Have you been able to confirm this behaviour since posting? Have you tried this out on multiple workers and viewed their memory consumption? I'm new to Spark and don't have a cluster to play with at present, and want to do similar loading from NFS files. My understanding is that calls to

Repartition Memory Leak

2015-01-04 Thread Brad Willard
I have a 10 node cluster with 600gb of ram. I'm loading a fairly large dataset from json files. When I load the dataset it is about 200gb however it only creates 60 partitions. I'm trying to repartition to 256 to increase cpu utilization however when I do that it balloons in memory to way over 2x

PySpark Loading Json Following by groupByKey seems broken in spark 1.1.1

2014-12-06 Thread Brad Willard
When I run a groupByKey it seems to create a single tasks after the groupByKey that never stops executing. I'm loading a smallish json dataset that is 4 million. This is the code I'm running. rdd = sql_context.jsonFile(uri) rdd = rdd.cache() grouped = rdd.map(lambda row: (row.id,

driver memory management

2014-09-28 Thread Brad Miller
am using PySpark (so much of my processing happens outside of the allocated memory in Java) and running the Spark 1.1.0 release binaries. best, -Brad

Re: java.io.IOException Error in task deserialization

2014-09-26 Thread Brad Miller
I've had multiple jobs crash due to java.io.IOException: unexpected exception type; I've been running the 1.1 branch for some time and am now running the 1.1 release binaries. Note that I only use PySpark. I haven't kept detailed notes or the tracebacks around since there are other problems that

Re: java.io.IOException Error in task deserialization

2014-09-26 Thread Brad Miller
26, 2014 at 1:32 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: I've had multiple jobs crash due to java.io.IOException: unexpected exception type; I've been running the 1.1 branch for some time and am now running the 1.1 release binaries. Note that I only use PySpark. I haven't kept

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-26 Thread Brad Miller
dav...@databricks.com wrote: On Thu, Sep 25, 2014 at 11:25 AM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi Davies, Thanks for your help. I ultimately re-wrote the code to use broadcast variables, and then received an error when trying to broadcast self.all_models that the size did

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-25 Thread Brad Miller
for a while since there was no way to unpersist them in pyspark, but now that there is you're completely right that using broadcast is the correct way to code this. best, -Brad On Tue, Sep 23, 2014 at 12:16 PM, Davies Liu dav...@databricks.com wrote: Or maybe there is a bug related to the base64 in py4j

java.util.NoSuchElementException: key not found

2014-09-16 Thread Brad Miller
. best, -Brad - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-14 Thread Brad Miller
somewhere and it would be great to persist it somewhere besides the mailing list. best, -Brad On Fri, Sep 12, 2014 at 12:12 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Andrew, This email was pretty helpful. I feel like this stuff should be summarized in the docs somewhere, or perhaps

Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Brad Miller
._jschema_rdd.coalesce(N, False, None), sqlCtx) Note: _schema_rdd - _jschema_rdd false - False That workaround seems to work fine (in that I've observed the correct number of partitions in the web-ui, although haven't tested it any beyond that). Thanks! -Brad On Thu, Sep 11, 2014 at 11:30 PM, Davies Liu dav

coalesce on SchemaRDD in pyspark

2014-09-11 Thread Brad Miller
. Unfortunately each new table has the same number of partitions as the original (despite being much smaller). Hence my interest in coalesce and repartition. Has anybody else encountered this bug? Is there an alternate workflow I should consider? I am running the 1.1.0 binaries released today. best, -Brad

Re: TimeStamp selection with SparkSQL

2014-09-05 Thread Brad Miller
My approach may be partly influenced by my limited experience with SQL and Hive, but I just converted all my dates to seconds-since-epoch and then selected samples from specific time ranges using integer comparisons. On Thu, Sep 4, 2014 at 6:38 PM, Cheng, Hao hao.ch...@intel.com wrote: There

Re: TimeStamp selection with SparkSQL

2014-09-05 Thread Brad Miller
to back a table in a SQLContext. On Fri, Sep 5, 2014 at 9:53 AM, Benjamin Zaitlen quasi...@gmail.com wrote: Hi Brad, When you do the conversion is this a Hive/Spark job or is it a pre-processing step before loading into HDFS? ---Ben On Fri, Sep 5, 2014 at 10:29 AM, Brad Miller bmill

Re: Spark webUI - application details page

2014-08-29 Thread Brad Miller
. : java.io.IOException: Call to crosby.research.intel-research.net/10.212.84.53:54310 failed on local exception: java.io.EOFException -Brad On Thu, Aug 28, 2014 at 12:26 PM, SK skrishna...@gmail.com wrote: I was able to recently solve this problem for standalone mode. For this mode, I did not use a history

/tmp/spark-events permissions problem

2014-08-29 Thread Brad Miller
with permissions on the spark.eventLog.dir directory? best, -Brad

Re: Spark webUI - application details page

2014-08-28 Thread Brad Miller
running pyspark. Do you know what may be causing this? When you attempt to reproduce locally, who do you observe owns the files in /tmp/spark-events? best, -Brad On Tue, Aug 26, 2014 at 8:51 AM, SK skrishna...@gmail.com wrote: I have already tried setting the history server and accessing

Re: Spark webUI - application details page

2014-08-15 Thread Brad Miller
? -Brad On Thu, Aug 14, 2014 at 7:33 PM, Andrew Or and...@databricks.com wrote: Hi SK, Not sure if I understand you correctly, but here is how the user normally uses the event logging functionality: After setting spark.eventLog.enabled and optionally spark.eventLog.dir, the user runs his/her

SPARK_DRIVER_MEMORY

2014-08-14 Thread Brad Miller
, or somewhere to look in the source for an exhaustive list of environment variable configuration options? best, -Brad - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

SPARK_LOCAL_DIRS

2014-08-14 Thread Brad Miller
, or if I am perhaps using it incorrectly? best, -Brad

trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
on the documentation and the examples that work, it seems like the failing examples are probably meant to be supported features. I was unable to find an open issue for this. Does anybody know if there is an open issue, or whether an issue should be created? best, -Brad

Re: trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Thanks Yin! best, -Brad On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai yh...@databricks.com wrote: Hi Brad, It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be fixed soon. Thanks, Yin On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller bmill

pyspark inferSchema

2014-08-05 Thread Brad Miller
to this problem in PySpark? I'm am running the 1.0.1 release. -Brad

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
the difference in the schema. Are you running the 1.0.1 release, or a more bleeding-edge version from the repository? best, -Brad On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I was just about to ask about this. Currently, there are two methods

trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
as well as the full traceback below. Note that I don't have any problem when I parse the JSON myself and use inferSchema. Is anybody able to reproduce this bug? -Brad srdd = sqlCtx.jsonRDD(sc.parallelize(['{foo:bar, baz:[1,2,3]}', '{foo:boom, baz:[1,2,3]}'])) srdd.printSchema() root |-- baz

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Got it. Thanks! On Tue, Aug 5, 2014 at 11:53 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Notice the difference in the schema. Are you running the 1.0.1 release, or a more bleeding-edge version from the repository? Yep, my bad. I’m running off master at commit

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Nick: Thanks for both the original JIRA bug report and the link. Michael: This is on the 1.0.1 release. I'll update to master and follow-up if I have any problems. best, -Brad On Tue, Aug 5, 2014 at 12:04 PM, Michael Armbrust mich...@databricks.com wrote: Is this on 1.0.1? I'd suggest

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
It sounds like updating to master may help address my issue (and may also make the sample argument available), so I'm going to go ahead and do that. best, -Brad On Tue, Aug 5, 2014 at 12:01 PM, Davies Liu dav...@databricks.com wrote: On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas nicholas.cham

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
-2376 best, -brad On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu dav...@databricks.com wrote: This sample argument of inferSchema is still no in master, if will try to add it if it make sense. On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi Davies, Thanks

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
anybody else verify that the second example still crashes (and is meant to work)? If so, would it be best to modify JIRA-2376 or start a new bug? https://issues.apache.org/jira/browse/SPARK-2376 best, -Brad On Tue, Aug 5, 2014 at 12:10 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Nick

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
: [[item0, item1], [item2, item3]]}'])).collect() [Row(key0=None), Row(key0=[[u'item0', u'item1'], [u'item2', u'item3']])] Is anyone able to replicate this behavior? -Brad On Tue, Aug 5, 2014 at 6:11 PM, Michael Armbrust mich...@databricks.com wrote: We try to keep master very stable

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
the proper schema. We will take a look it. Thanks, Yin On Tue, Aug 5, 2014 at 12:20 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Assuming updating to master fixes the bug I was experiencing with jsonRDD and jsonFile, then pushing sample to master will probably not be necessary

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
I concur that printSchema works; it just seems to be operations that use the data where trouble happens. Thanks for posting the bug. -Brad On Tue, Aug 5, 2014 at 10:05 PM, Yin Huai yh...@databricks.com wrote: I tried jsonRDD(...).printSchema() and it worked. Seems the problem is when we

Re: Announcing Spark 1.0.1

2014-07-12 Thread Brad Miller
to get confused and miss out. http://spark.apache.org/docs/1.0.1/api/python/index.html best, -Brad On Fri, Jul 11, 2014 at 8:44 PM, Henry Saputra henry.sapu...@gmail.com wrote: Congrats to the Spark community ! On Friday, July 11, 2014, Patrick Wendell pwend...@gmail.com wrote: I am happy

odd caching behavior or accounting

2014-06-30 Thread Brad Miller
-kkeo3mxl1qux0fnz0depku4ffbxo2w_5cdmtrzykufh...@mail.gmail.com%3e/2 and PNG http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/raw/%3ccanr-kkeo3mxl1qux0fnz0depku4ffbxo2w_5cdmtrzykufh...@mail.gmail.com%3e/3 files original attached are now available as linked from the apache archive. best, -Brad

pyspark bug with unittest and scikit-learn

2014-06-19 Thread Brad Miller
or if it is significant. best, -Brad

pyspark join crash

2014-06-04 Thread Brad Miller
). Can anybody confirm if this is the behavior of pyspark? I am glad to supply additional details about my observed behavior upon request. best, -Brad

Re: pyspark join crash

2014-06-04 Thread Brad Miller
that entire blocks are loaded from the writers, then it would seem like there is some significant overhead which is chewing threw lots of memory (perhaps similar to the problem with python broadcast variables chewing through memory https://spark-project.atlassian.net/browse/SPARK-1065). -Brad On Wed, Jun

Re: Hung inserts?

2014-04-21 Thread Brad Heller
that give it a go. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Apr 21, 2014 at 3:20 AM, Brad Heller brad.hel...@gmail.comwrote: Hey list, I've got some CSV data I'm importing from S3. I can create the external

Re: Hung inserts?

2014-04-21 Thread Brad Heller
wrong?? https://gist.github.com/bradhe/11159123 On Mon, Apr 21, 2014 at 3:31 PM, Brad Heller brad.hel...@gmail.com wrote: I tried removing the CLUSTERED directive and get the same results :( I also removed SORTED, same deal. I'm going to try removign partitioning all together for now

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
I would echo much of what Andrew has said. I manage a small/medium sized cluster (48 cores, 512G ram, 512G disk space dedicated to spark, data storage in separate HDFS shares). I've been using spark since 0.7, and as with Andrew I've observed significant and consistent improvements in stability

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
4. Shuffle on disk Is it true - I couldn't find it in official docs, but did see this mentioned in various threads - that shuffle _always_ hits disk? (Disregarding OS caches.) Why is this the case? Are you planning to add a function to do shuffle in memory or are there some intrinsic reasons

Re: trouble with join on large RDDs

2014-04-09 Thread Brad Miller
in Soda Hall (Berkeley) so if anyone near by is interested to examine this first hand I am glad to meet up. best, -Brad On Wed, Apr 9, 2014 at 4:21 AM, Andrew Ash and...@andrewash.com wrote: A JVM can easily be limited in how much memory it uses with the -Xmx parameter, but Python doesn't have

pyspark broadcast error

2014-03-11 Thread Brad Miller
with X = 29 seemed different and I was wondering if anybody had any insight. -Brad *Program* from pyspark import SparkContext SparkContext.setSystemProperty('spark.executor.memory', '25g') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'FeatureExtraction') meg_512 = range((2