Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-15 Thread Russell Jurney
There is an implementation for Spark 3, but GraphFrames isn't released often enough to match every point version. It supports Spark 3.4. Try it - it will probably work. https://spark-packages.org/package/graphframes/graphframes Thanks, Russell Jurney @rjurney <http://twitter.com/rjur

Re: read a binary file and save in another location

2023-03-09 Thread Russell Jurney
Yeah, that's the right answer! Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly <https://calendly.com/rjurney_personal/3

Re: read a binary file and save in another location

2023-03-09 Thread Russell Jurney
ocumented <https://spark.apache.org/docs/3.1.3/api/java/org/apache/spark/sql/DataFrameWriter.html#format-java.lang.String-> . Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney

Re: [New Project] sparksql-ml : Distributed Machine Learning using SparkSQL.

2023-02-27 Thread Russell Jurney
I think it is awesome. Brilliant interface that is missing from Spark. Would you integrate with something like MLFlow? Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney>

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Russell Jurney
Oliver, just curious: did you get a clean error message when you broke it out into separate statements? Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndro

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Russell Jurney
Usually, the solution to these problems is to do less per line, break it out and perform each minute operation as a field, then combine those into a final answer. Can you do that here? Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://lin

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-23 Thread Russell Jurney
> 2. Using base dataframes itself (without explicit repartitioning) to >> perform join+aggregatio >> > >> > I have a StackOverflow post with more details regarding the same: >> > https://stackoverflow.com/q/74771971/14741697 >> > >> > Thanks in a

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
d benefit (a lot) from it? >> >> Thanks, >> >> --- Sungwoo >> >> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney >> wrote: >> >>> I don't think Spark can do this with its current architecture. It has to >>> wait for the step to be done, speculative exec

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
Oops, it has been long since Russell labored on Hadoop, speculative execution isn’t the right term - that is something else. Cascading has a declarative interface so you can plan more, whereas Spark is more imperative. Point remains :) On Wed, Sep 7, 2022 at 3:56 PM Russell Jurney wrote: >

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
park jobs would benefit (a lot) from it? > > Thanks, > > --- Sungwoo > > On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney > wrote: > >> I don't think Spark can do this with its current architecture. It has to >> wait for the step to be done, speculative execution

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
I don't think Spark can do this with its current architecture. It has to wait for the step to be done, speculative execution isn't possible. Others probably know more about why that is. Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://lin

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Russell Jurney
YOU know what you're talking about and aren't hacking a solution. You are my new friend :) Thank you, this is incredibly helpful! Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://face

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Russell Jurney
consistent, measurable and valid results! :) Russell Jurney On Thu, Aug 25, 2022 at 10:00 AM Sean Owen wrote: > It's important to realize that while pandas UDFs and pandas on Spark are > both related to pandas, they are not themselves directly related. The first > lets you use pandas wit

Re: Can't load a RandomForestClassificationModel in Spark job

2017-02-16 Thread Russell Jurney
/make_predictions_streaming.py I had to create a pyspark.sql.Row in a map operation in an RDD before I call spark.createDataFrame. Check out lines 92-138. Not sure if this helps, but I thought I'd give it a try ;) --- Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI

Re: Spark / Elasticsearch Error: Maybe ES was overloaded? How to throttle down Spark as it writes to ES

2017-01-18 Thread Russell Jurney
never found them too much use. > > check out these settings, maybe they are of some help: > es.batch.size.bytes > es.batch.size.entries > es.http.timeout > es.batch.write.retry.count > es.batch.write.retry.wait > > > On Tue, Jan 17, 2017 at 10:13 PM, Russell Jurney <rus

Spark / Elasticsearch Error: Maybe ES was overloaded? How to throttle down Spark as it writes to ES

2017-01-17 Thread Russell Jurney
://discuss.elastic.co/t/spark-elasticsearch-exception-maybe-es-was-overloaded/71932 Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

In PySpark ML, how can I interpret the SparseVector returned by a pyspark.ml.classification.RandomForestClassificationModel.featureImportances ?

2016-12-21 Thread Russell Jurney
on Github here <https://github.com/rjurney/Agile_Data_Code_2/blob/master/ch09/Debugging%20Prediction%20Problems.ipynb>, skip to the end. Stack Overflow post: http://stackoverflow.com/questions/41273893/in-pyspark-ml-how-can-i-interpret-the-sparsevector-returned-by-a-pyspark-ml-cla Thanks! -- R

Re: What is the deployment model for Spark Streaming? A specific example.

2016-12-17 Thread Russell Jurney
Anyone? This is for a book, so I need to figure this out. On Fri, Dec 16, 2016 at 12:53 AM Russell Jurney <russell.jur...@gmail.com> wrote: > I have created a PySpark Streaming application that uses Spark ML to > classify flight delays into three categories: on-time, slightly late,

What is the deployment model for Spark Streaming? A specific example.

2016-12-16 Thread Russell Jurney
, maybe that is the problem? ssc.start() ssc.awaitTermination() What is the actual deployment model for Spark Streaming? All I know to do right now is to restart the PID. I'm new to Spark, and the docs don't really explain this (that I can see). Thanks! -- Russell Jurney twitter.com/rjurney

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-16 Thread Russell Jurney
you might convert the dataframe to > an rdd using something like this: > > df > .toJavaRDD() > .map(row -> (SparseVector)row.getAs(row.fieldIndex("columnName"))); > > On Tue, Nov 15, 2016 at 1:06 PM, Russell Jurney <russell.jur...@gmail.com> > w

Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-15 Thread Russell Jurney
, but haven't found anything. Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Parquet compression jars not found - both snappy and lzo - PySpark 2.0.0

2016-09-27 Thread Russell Jurney
://gist.github.com/rjurney/6783d19397cf3b4b88af3603d6e256bd -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: Automating lengthy command to pyspark with configuration?

2016-08-29 Thread Russell Jurney
I've got most of it working through spark.jars On Sunday, August 28, 2016, ayan guha <guha.a...@gmail.com> wrote: > Best to create alias and place in your bashrc > On 29 Aug 2016 08:30, "Russell Jurney" <russell.jur...@gmail.com > <javascript:_e(%7B%7D,'cvml','russ

Automating lengthy command to pyspark with configuration?

2016-08-28 Thread Russell Jurney
additions to pyspark? Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: --packages configuration equivalent item name?

2016-04-05 Thread Russell Jurney
gt;>> com.databricks:spark-csv_2.11:1.3.0,datastax:spark-cassandra-connector:1.6.0-M1-s_2.10" >>> >>> >>> export PYSPARK_PYTHON=python3 >>> >>> export PYSPARK_DRIVER_PYTHON=python3 >>> >>> IPYTHON_OPTS=notebook $SPARK_RO

Re: --packages configuration equivalent item name?

2016-04-02 Thread Russell Jurney
xport PYSPARK_DRIVER_PYTHON=python3 > > IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs --conf > spark.cassandra.connection.host= > ec2-54-153-102-232.us-west-1.compute.amazonaws.com $* > > > > From: Russell Jurney <russell.jur...@gmail.com> > Date: Sunday, M

What is the most efficient way to do a sorted reduce in PySpark?

2016-04-02 Thread Russell Jurney
://stackoverflow.com/questions/36376369/what-is-the-most-efficient-way-to-do-a-sorted-reduce-in-pyspark Gist: https://gist.github.com/rjurney/af27f70c76dc6c6ae05c465271331ade -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-30 Thread Russell Jurney
Actually, I can imagine a one or two line fix for this bug: call row.asDict() inside a wrapper for DataFrame.rdd. Probably deluding myself this could be so easily resolved? :) On Wed, Mar 30, 2016 at 6:10 PM, Russell Jurney <russell.jur...@gmail.com> wrote: > Thanks to some excel

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-30 Thread Russell Jurney
to a database is a pretty common thing to do from PySpark, and lots of analysis must be happening in DataFrames in PySpark? Anyway, the workaround for this bug is easy, cast the rows as dicts: my_dataframe = my_dataframe.map(lambda row: row.asDict()) On Mon, Mar 28, 2016 at 8:08 PM, Russell Jurney

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
btw, they can't be saved to BSON either. This seems a generic issue, can anyone else reproduce this? On Mon, Mar 28, 2016 at 8:02 PM, Russell Jurney <russell.jur...@gmail.com> wrote: > I created a JIRA: https://issues.apache.org/jira/browse/SPARK-14229 > > On Mon, Mar 28,

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-14229 On Mon, Mar 28, 2016 at 7:43 PM, Russell Jurney <russell.jur...@gmail.com> wrote: > Ted, I am using the .rdd method, see above, but for some reason these RDDs > can't be saved to MongoDB or ElasticSearch.

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
DD[T] = { > > On Mon, Mar 28, 2016 at 6:30 PM, Russell Jurney <russell.jur...@gmail.com> > wrote: > >> Ok, I'm also unable to save to Elasticsearch using a dataframe's RDD. >> This seems related to DataFrames. Is there a way to convert a DataFrame's >> RDD to a 'n

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
Ok, I'm also unable to save to Elasticsearch using a dataframe's RDD. This seems related to DataFrames. Is there a way to convert a DataFrame's RDD to a 'normal' RDD? On Mon, Mar 28, 2016 at 6:20 PM, Russell Jurney <russell.jur...@gmail.com> wrote: > I filed a JIRA <https://jira

PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
urrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: DataFrame --> JSON objects, instead of un-named array of fields

2016-03-28 Thread Russell Jurney
To answer my own question, DataFrame.toJSON() does this, so there is no need to map and json.dump(): on_time_dataframe.toJSON().saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl') Thanks! On Mon, Mar 28, 2016 at 12:54 PM, Russell Jurney <russell.jur...@gmail.com>

DataFrame --> JSON objects, instead of un-named array of fields

2016-03-28 Thread Russell Jurney
l, 0, null, null, null, null, "", null, null, null, null, null, null, "", "", null, null, null, null, null, null, "", "", null, null, null, null, null, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""] What I actually want is JSON objects, with a field name for each field: {"year": "2015", "month": 1, ...} How can I achieve this in PySpark? Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

--packages configuration equivalent item name?

2016-03-27 Thread Russell Jurney
If there is no way to do this, please let me know so I can make a JIRA for this feature. Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: Spark JDBC connection - data writing success or failure cases

2016-02-19 Thread Russell Jurney
- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;> > For additional commands, e-mail: user-h...@spark.apache.org <javascript:;> > > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: GraphX can show graph?

2016-01-29 Thread Russell Jurney
ssible >>> to create ad show graph (for visualization purpose) using GraphX. Any >>> pointer to tutorial or information connected to this will be really helpful >>> >>> Thanks and regards >>> Bala >>> >> >> > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: processing large dataset

2015-01-22 Thread Russell Jurney
...@spark.apache.org javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org'); -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
https://gist.github.com/rjurney/fd5c0110fe7eb686afc9 Any way I try to join my data fails. I can't figure out what I'm doing wrong. -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com ᐧ

Re: PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
://spark.apache.org/docs/1.1.0/sql-programming-guide.html Davies On Fri, Oct 17, 2014 at 5:01 PM, Russell Jurney russell.jur...@gmail.com wrote: https://gist.github.com/rjurney/fd5c0110fe7eb686afc9 Any way I try to join my data fails. I can't figure out what I'm doing wrong. -- Russell Jurney

Re: PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
There was a bug in the devices line: dh.index('id') should have been x[dh.index('id')]. ᐧ On Fri, Oct 17, 2014 at 5:52 PM, Russell Jurney russell.jur...@gmail.com wrote: Is that not exactly what I've done in j3/j4? The keys are identical strings.The k is the same, the value in both instances

Re: Update on Pig on Spark initiative

2014-08-28 Thread Russell Jurney
/mayur_rustagi -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
resource allocations they have. On Sun, Jun 1, 2014 at 6:47 PM, Russell Jurney russell.jur...@gmail.com wrote: Thanks again. Run results here: https://gist.github.com/rjurney/dc0efae486ba7d55b7d5 This time I get a port already in use exception on 4040, but it isn't fatal. Then when I run

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
:09 AM, Russell Jurney russell.jur...@gmail.com wrote: Looks like just worker and master processes are running: [hivedata@hivecluster2 ~]$ jps 10425 Jps [hivedata@hivecluster2 ~]$ ps aux|grep spark hivedata 10424 0.0 0.0 103248 820 pts/3S+ 10:05 0:00 grep spark root 10918

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
does hivecluster2:8080 look like? My guess is it says there are 2 applications registered, and one has taken all the executors. There must be two applications running, as those are the only things that keep open those 4040/4041 ports. On Mon, Jun 2, 2014 at 11:32 AM, Russell Jurney russell.jur

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Russell Jurney
:37 PM, Russell Jurney russell.jur...@gmail.com wrote: Now I get this: scala rdd.first 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at console:41) with 1 output partitions (allowLocal=true) 14

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Russell Jurney
(avro.jar, ...) val sc = new SparkContext(conf) On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney russell.jur...@gmail.com wrote: Followup question: the docs to make a new SparkContext require that I know where $SPARK_HOME is. However, I have no idea. Any idea where that might be? On Sun, Jun 1

hadoopRDD stalls reading entire directory

2014-05-31 Thread Russell Jurney
... And never finishes. What should I do? -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com