Spark 3.5.x on Java 21?

2024-05-08 Thread Stephen Coy
Hi everyone, We’re about to upgrade our Spark clusters from Java 11 and Spark 3.2.1 to Spark 3.5.1. I know that 3.5.1 is supposed to be fine on Java 17, but will it run OK on Java 21? Thanks, Steve C This email contains confidential information of and is the copyright of Infomedia. It

Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich, I was just reading random questions on the user list when I noticed that you said: On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh wrote: 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: Dataproc serverless for Spark

2022-11-21 Thread Stephen Boesch
features. Are there specific limitations you are aware of / run into? stephen b On Mon, 21 Nov 2022 at 09:01, Mich Talebzadeh wrote: > Hi, > > I have not tested this myself but Google have brought up *Dataproc Serverless > for Spar*k. in a nutshell Dataproc Serverless lets you run

Re: Spark Scala Contract Opportunity @USA

2022-11-10 Thread Stephen Boesch
Please do not send advertisements on this channel. On Thu, 10 Nov 2022 at 13:40, sri hari kali charan Tummala < kali.tumm...@gmail.com> wrote: > Hi All, > > Is anyone looking for a spark scala contract role inside the USA? A > company called Maxonic has an open spark scala contract position

Re: [Building] Building with JDK11

2022-07-18 Thread Stephen Coy
of Apache Maven. Cheers, Steve C On 18 Jul 2022, at 4:12 pm, Sergey B. mailto:sergey.bushma...@gmail.com>> wrote: Hi Steve, Can you shed some light why do they need $JAVA_HOME at all if everything is already in place? Regards, - Sergey On Mon, Jul 18, 2022 at 4:31 AM Stephen Coy ma

Re: [Building] Building with JDK11

2022-07-17 Thread Stephen Coy
Hi Szymon, There seems to be a common misconception that setting JAVA_HOME will set the version of Java that is used. This is not true, because in most environments you need to have a PATH environment variable set up that points at the version of Java that you want to use. You can set

Re: Retrieve the count of spark nodes

2022-06-08 Thread Stephen Coy
Hi there, We use something like: /* * Force Spark to initialise the defaultParallelism by executing a dummy parallel operation and then return * the resulting defaultParallelism. */ private int getWorkerCount(SparkContext sparkContext) { sparkContext.parallelize(List.of(1, 2, 3,

Re: Using Avro file format with SparkSQL

2022-02-14 Thread Stephen Coy
Hi Morven, We use —packages for all of our spark jobs. Spark downloads the specified jar and all of its dependencies from a Maven repository. This means we never have to build fat or uber jars. It does mean that the Apache Ivy configuration has to be set up correctly though. Cheers, Steve C

Re: Migration to Spark 3.2

2022-01-27 Thread Stephen Coy
Mazoyer mailto:aurel...@aepsilon.com>> wrote: Hi Stephen, Thank you for your answer! Here it is, it seems that jackson dependencies are correct, no? : Thanks, [INFO] com.krrier:spark-lib-full:jar:0.0.1-SNAPSHOT [INFO] +- com.krrier:backend:jar:0.0.1-SNAPSHOT:compile

Re: Migration to Spark 3.2

2022-01-26 Thread Stephen Coy
Hi Aurélien! Please run mvn dependency:tree and check it for Jackson dependencies. Feel free to respond with the output if you have any questions about it. Cheers, Steve C > On 22 Jan 2022, at 10:49 am, Aurélien Mazoyer wrote: > > Hello, > > I migrated my code to Spark 3.2 and I am

Re: Log4J 2 Support

2021-11-09 Thread Stephen Coy
as do other libs, and that isn't what the shims cover. Could be possible now or with more cleverness but the simple thing didn't work out IIRC. On Tue, Nov 9, 2021, 4:32 PM Stephen Coy mailto:s...@infomedia.com.au>> wrote: Hi there, It’s true that the preponderance of log4j 1.2.x in many e

Re: Log4J 2 Support

2021-11-09 Thread Stephen Coy
Hi there, It’s true that the preponderance of log4j 1.2.x in many existing live projects is kind of a pain in the butt. But there is a solution. 1. Migrate all Spark code to use slf4j APIs; 2. Exclude log4j 1.2.x from any dependencies sucking it in; 3. Include the log4j-over-slf4j bridge jar

Re: Missing module spark-hadoop-cloud in Maven central

2021-06-01 Thread Stephen Coy
I have been building Apache Spark from source just so I can get this dependency. 1. git checkout v3.1.1 2. dev/make-distribution.sh --name hadoop-cloud-3.2 --tgz -Pyarn -Phadoop-3.2 -Pyarn -Phadoop-cloud -Phive-thriftserver -Dhadoop.version=3.2.0 It is kind of a nuisance having to do

Re: pyspark sql load with path of special character

2021-04-25 Thread Stephen Coy
It probably does not like the colons in the path name “…20:04:27+00:00/…”, especially if you’re running on a Windows box. On 24 Apr 2021, at 1:29 am, Regin Quinoa mailto:sweatr...@gmail.com>> wrote: Hi, I am using pyspark sql to load files into table following ```LOAD DATA LOCAL INPATH

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Stephen Coy
Hi there, At risk of stating the obvious, the first step is to ensure that your Spark application and S3 bucket are colocated in the same AWS region. Steve C On 16 Mar 2021, at 3:31 am, Alchemist mailto:alchemistsrivast...@gmail.com>> wrote: How to optimize s3 list S3 file using

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch
I agree with Wim's assessment of data engineering / ETL vs Data Science. I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction. On Sat, 10 Oct 2020 at 13:03, Mich

Re: Unsubscribe

2020-08-26 Thread Stephen Coy
The instructions for all Apache mail lists are in the mail headers: List-Unsubscribe: On 27 Aug 2020, at 7:49 am, Jeff Evans mailto:jeffrey.wayne.ev...@gmail.com>> wrote: That is not how you unsubscribe. See here for instructions:

Re: S3 read/write from PySpark

2020-08-11 Thread Stephen Coy
:238) at java.base/java.lang.Thread.run(Thread.java:834) On Thu, 6 Aug 2020 at 17:19, Stephen Coy mailto:s...@infomedia.com.au>> wrote: Hi Daniel, It looks like …BasicAWSCredentialsProvider has become org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider. However, the way tha

Re: S3 read/write from PySpark

2020-08-06 Thread Stephen Coy
Hi Daniel, It looks like …BasicAWSCredentialsProvider has become org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider. However, the way that the username and password are provided appears to have changed so you will probably need to look in to that. Cheers, Steve C On 6 Aug 2020, at 11:15

Re: Tab delimited csv import and empty columns

2020-08-05 Thread Stephen Coy
Because its default is the empty string, empty strings become null by default. On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy mailto:s...@infomedia.com.au.invalid>> wrote: That does not work. This is Spark 3.0 by the way. I have been looking at the Spark unit tests and there does not seem to be

Re: Tab delimited csv import and empty columns

2020-07-31 Thread Stephen Coy
quot;) Hope it helps On Thu, 30 Jul 2020 at 08:49, Stephen Coy mailto:s...@infomedia.com.au.invalid>> wrote: Hi there, I’m trying to import a tab delimited file with: Dataset catalogData = sparkSession .read() .option("sep", "\t") .option("header", &qu

Tab delimited csv import and empty columns

2020-07-30 Thread Stephen Coy
Hi there, I’m trying to import a tab delimited file with: Dataset catalogData = sparkSession .read() .option("sep", "\t") .option("header", "true") .csv(args[0]) .cache(); This works great, except for the fact that any column that is empty is given the value null, when I need these

Re: Kotlin Spark API

2020-07-14 Thread Stephen Boesch
{ println(it) } } } So that shows some of the niceness of kotlin: intuitive type conversion `to`/`to` and `dsOf( list)`- and also the inlining of the side effects. Overall concise and pleasant to read. On Tue, 14 Jul 2020 at 12:18, Stephen Boesch wrote: > I started with scala/spark in

Re: Kotlin Spark API

2020-07-14 Thread Stephen Boesch
I started with scala/spark in 2012 and scala has been my go-to language for six years. But I heartily applaud this direction. Kotlin is more like a simplified Scala - with the benefits that brings - than a simplified java. I particularly like the simplified / streamlined collections classes.

When does SparkContext.defaultParallelism have the correct value?

2020-07-06 Thread Stephen Coy
Hi there, I have found that if I invoke sparkContext.defaultParallelism() too early it will not return the correct value; For example, if I write this: final JavaSparkContext sparkContext = new JavaSparkContext(sparkSession.sparkContext()); final int workerCount =

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-06 Thread Stephen Coy
Hi Steve, While I understand your point regarding the mixing of Hadoop jars, this does not address the java.lang.ClassNotFoundException. Prebuilt Apache Spark 3.0 builds are only available for Hadoop 2.7 or Hadoop 3.2. Not Hadoop 3.1. The only place that I have found that missing class is in

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Stephen Boesch
Spark in local mode (which is different than standalone) is a solution for many use cases. I use it in conjunction with (and sometimes instead of) pandas/pandasql due to its much wider ETL related capabilities. On the JVM side it is an even more obvious choice - given there is no equivalent to

unsubscribe

2020-06-28 Thread stephen

Re: Hey good looking toPandas ()

2020-06-19 Thread Stephen Boesch
afaik It has been there since Spark 2.0 in 2015. Not certain about Spark 1.5/1.6 On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan wrote: > I first ran the command > df.show() > > For sanity check of my dataFrame. > > I wasn't impressed with the display. > > I then ran > df.toPandas() in Jupiter

Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-18 Thread Stephen Coy
Hi Murat Migdisoglu, Unfortunately you need the secret sauce to resolve this. It is necessary to check out the Apache Spark source code and build it with the right command line options. This is what I have been using: dev/make-distribution.sh --name my-spark --tgz -Pyarn -Phadoop-3.2 -Pyarn

Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
the predicates are typically sql's. Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch : > Hi Mich! >I think you can combine the good/rejected into one method that > internally: > >- Create good/rejected df's given an input df and input >rules/predicates to apply to the

Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
Hi Mich! I think you can combine the good/rejected into one method that internally: - Create good/rejected df's given an input df and input rules/predicates to apply to the df. - Create a third df containing the good rows and the rejected rows with the bad columns nulled out -

Re: Going it alone.

2020-04-16 Thread Stephen Boesch
The warning signs were there from the first email sent from that person. I wonder is there any way to deal with this more proactively. Am Do., 16. Apr. 2020 um 10:54 Uhr schrieb Mich Talebzadeh < mich.talebza...@gmail.com>: > good for you. right move > > Dr Mich Talebzadeh > > > > LinkedIn * >

Re: IDE suitable for Spark

2020-04-07 Thread Stephen Boesch
I have been using Idea for both scala/spark and pyspark projects since 2013. It required fair amount of fiddling that first year but has been stable since early 2015. For pyspark projects only Pycharm naturally also works v well. Am Di., 7. Apr. 2020 um 09:10 Uhr schrieb yeikel valdes : > >

Re: [PySpark] How to write HFiles as an 'append' to the same directory?

2020-03-16 Thread Stephen Coy
I encountered a similar problem when trying to: ds.write().save(“s3a://some-bucket/some/path/table”); which writes the content as a bunch of parquet files in the “folder” named “table”. I am using a Flintrock cluster with the Spark 3.0 preview FWIW. Anyway, I just used the AWS SDK to remove

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Stephen Coy
Hi there, I’m kind of new around here, but I have had experience with all of all the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as well as Postgresql. They all support the notion of “ANSI padding” for CHAR columns - which means that such columns are always

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
same code. why running them two different ways vary so much in the > execution time. > > > > > *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* > > > On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch wrote: > >> Sounds like you have done your homework to

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Sounds like you have done your homework to properly compare . I'm guessing the answer to the following is yes .. but in any case: are they both running against the same spark cluster with the same configuration parameters especially executor memory and number of workers? Am Di., 10. Sept. 2019

Spark on YARN with private Docker repositories/registries

2019-08-16 Thread Tak-Lon (Stephen) Wu
it load from it by default? Thanks, Stephen

Re: Incremental (online) machine learning algorithms on ML

2019-08-05 Thread Stephen Boesch
There are several high bars to getting a new algorithm adopted. * It needs to be deemed by the MLLib committers/shepherds as widely useful to the community. Algorithms offered by larger companies after having demonstrated usefulness at scale for use cases likely to be encountered by many

How to execute non-timestamp-based aggregations in spark structured streaming?

2019-04-20 Thread Stephen Boesch
Consider the following *intended* sql: select row_number() over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,* from flights This will *not* work in *structured streaming* : The culprit is: partition by Origin The requirement is to use a timestamp-typed field such as

Re: spark-sklearn

2019-04-08 Thread Stephen Boesch
There are several suggestions on this SOF https://stackoverflow.com/questions/38984775/spark-errorexpected-zero-arguments-for-construction-of-classdict-for-numpy-cor 1 You need to convert the final value to a python list. You implement the function as follows: def uniq_array(col_array): x =

Re: Build spark source code with scala 2.11

2019-03-12 Thread Stephen Boesch
You might have better luck downloading the 2.4.X branch Am Di., 12. März 2019 um 16:39 Uhr schrieb swastik mittal : > Then are the mlib of spark compatible with scala 2.12? Or can I change the > spark version from spark3.0 to 2.3 or 2.4 in local spark/master? > > > > -- > Sent from:

Re: Build spark source code with scala 2.11

2019-03-12 Thread Stephen Boesch
I think scala 2.11 support was removed with the spark3.0/master Am Di., 12. März 2019 um 16:26 Uhr schrieb swastik mittal : > I am trying to build my spark using build/sbt package, after changing the > scala versions to 2.11 in pom.xml because my applications jar files use > scala 2.11. But

Re: Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread Stephen Boesch
So the LogisticRegression with regParam and elasticNetParam set to 0 is not what you are looking for? https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#logistic-regression .setRegParam(0.0) .setElasticNetParam(0.0) Am Do., 11. Okt. 2018 um 15:46 Uhr schrieb pikufolgado

Fixing NullType for parquet files

2018-09-12 Thread Stephen Boesch
sues.apache.org/jira/browse/SPARK-10943?focusedCommentId=16462797=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16462797> [image: javadba]Stephen Boesch <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=javadba> added a comment - 03/May/18 17:08

Re: [announce] BeakerX supports Scala+Spark in Jupyter

2018-06-07 Thread Stephen Boesch
Assuming that the spark 2.X kernel (e.g. toree) were chosen for a given jupyter notebook and there is a Cell 3 that contains some Spark DataFrame operations .. Then : - what is the relationship does the %%spark magic and the toree kernel? - how does the %%spark magic get applied to that

Re: Guava dependency issue

2018-05-08 Thread Stephen Boesch
(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6 2018-05-07 10:30 GMT-07:00 Stephen Boesch <java...@gmail.com>: > I am intermittently running into guava dependency issues across mutiple > spark projects. I have tried

Guava dependency issue

2018-05-07 Thread Stephen Boesch
I am intermittently running into guava dependency issues across mutiple spark projects. I have tried maven shade / relocate but it does not resolve the issues. The current project is extremely simple: *no* additional dependencies beyond scala, spark, and scalatest - yet the issues remain (and

Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
> On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <java...@gmail.com> wrote: > >> Do you have a machine with terabytes of RAM? afaik collect() requires >> RAM - so that would be your limiting factor. >> >> 2018-04-28 8:41 GMT-07:00 klrmowse <klrmo...@gmail

Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
Do you have a machine with terabytes of RAM? afaik collect() requires RAM - so that would be your limiting factor. 2018-04-28 8:41 GMT-07:00 klrmowse : > i am currently trying to find a workaround for the Spark application i am > working on so that it does not have to use

Re: parquet vs orc files

2018-02-21 Thread Stephen Joung
In case of parquet, best source for me to configure and to ensure "min/max statistics" was https://www.slideshare.net/mobile/RyanBlue3/parquet-performance-tuning-the-missing-guide --- I don't have any experience in orc. 2018년 2월 22일 (목) 오전 6:59, Kane Kim 님이 작성: >

Spark 2.2.1 EMR 5.11.1 Encrypted S3 bucket overwriting parquet file

2018-02-13 Thread Stephen Robinson
the error. I believe this to be an EMR error but mentioning it here just in case anyone else has seen this or if it might be a spark bug. Thanks, Steve Stephen Robinson steve.robin...@aquilainsight.com +441312902300 [http://www.aquilainsight.com/wp-content/uploads/2018/

Re: write parquet with statistics min max with binary field

2018-01-28 Thread Stephen Joung
For the reference, this was intended symptom by PARQUET-686 [1]. [1] https://www.mail-archive.com/commits@parquet.apache.org/msg00491.html 2018-01-24 10:31 GMT+09:00 Stephen Joung <step...@vcnc.co.kr>: > How can I write parquet file with min/max statistic? > > 2018-01-24

Re: write parquet with statistics min max with binary field

2018-01-23 Thread Stephen Joung
How can I write parquet file with min/max statistic? 2018-01-24 10:30 GMT+09:00 Stephen Joung <step...@vcnc.co.kr>: > Hi, I am trying to use spark sql filter push down. and specially want to > use row group skipping with parquet file. > > And I guessed that I need parquet fi

write parquet with statistics min max with binary field

2018-01-23 Thread Stephen Joung
arquet file f1 scala> List("a", "b", "c").toDF("field1").coalesce(1).write.parquet("f1") But saved file does not have statistics (min, max) $ ls f1/*.parquet f1/part-0-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet $

Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Stephen Boesch
While MLLib performed favorably vs Flink it *also *performed favorably vs spark.ml .. and by an *order of magnitude*. The following is one of the tables - it is for Logistic Regression. At that time spark.ML did not yet support SVM From: https://bdataanalytics.biomedcentral.com/articles/10.

Re: Anyone know where to find independent contractors in New York?

2017-12-21 Thread Stephen Boesch
Hi Richard, this is not a jobs board: please only discuss spark application development issues. 2017-12-21 8:34 GMT-08:00 Richard L. Burton III : > I'm trying to locate four independent contractors who have experience with > Spark. I'm not sure where I can go to find

Re: LDA and evaluating topic number

2017-12-07 Thread Stephen Boesch
I have been testing on the 20 NewsGroups dataset - which the Spark docs themselves reference. I can confirm that perplexity increases and likelihood decreases as topics increase - and am similarly confused by these results. 2017-09-28 10:50 GMT-07:00 Cody Buntain : > Hi,

Weight column values not used in Binary Logistic Regression Summary

2017-11-18 Thread Stephen Boesch
In BinaryLogisticRegressionSummary there are @Since("1.5.0") tags on a number of comments identical to the following: * @note This ignores instance weights (setting all to 1.0) from `LogisticRegression.weightCol`. * This will change in later Spark versions. Are there any plans to address this?

Re: Spark streaming for CEP

2017-10-24 Thread Stephen Boesch
Hi Mich, the github link has a brief intro - including a link to the formal docs http://logisland.readthedocs.io/en/latest/index.html . They have an architectural overview, developer guide, tutorial, and pretty comprehensive api docs. 2017-10-24 13:31 GMT-07:00 Mich Talebzadeh

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Stephen Boesch
@Vadim Would it be true to say the `.rdd` *may* be creating a new job - depending on whether the DataFrame/DataSet had already been materialized via an action or checkpoint? If the only prior operations on the DataFrame had been transformations then the dataframe would still not have been

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
repo - The local maven repo is included by default - so should not need to do anything special there The same errors from the original post continue to occur. 2017-10-11 20:05 GMT-07:00 Stephen Boesch <java...@gmail.com>: > A clarification here: the example is being run *from

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
n install and > define your local maven repo in SBT? > > -Paul > > Sent from my iPhone > > On Oct 11, 2017, at 5:48 PM, Stephen Boesch <java...@gmail.com> wrote: > > When attempting to run any example program w/ Intellij I am running into > guava ver

Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
When attempting to run any example program w/ Intellij I am running into guava versioning issues: Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/cache/CacheLoader at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73) at

Re: SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
Cheers > Jules > > Sent from my iPhone > Pardon the dumb thumb typos :) > > > On Aug 10, 2017, at 1:46 PM, Stephen Boesch <java...@gmail.com> wrote: > > > > > > While the DataFrame/DataSets are useful in many circumstances they are > cumbersome for many ty

SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
While the DataFrame/DataSets are useful in many circumstances they are cumbersome for many types of complex sql queries. Is there an up to date *SQL* reference - i.e. not DataFrame DSL operations - for version 2.2? An example of what is not clear: what constructs are supported within

custom joins on dataframe

2017-07-22 Thread Stephen Fletcher
Normally a family of joins (left, right outter, inner) are performed on two dataframes using columns for the comparison ie left("acol") === ight("acol") . the comparison operator of the "left" dataframe does something internally and produces a column that i assume is used by the join. What I want

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Stephen Boesch
Spark SQL did not support explicit partitioners even before tungsten: and often enough this did hurt performance. Even now Tungsten will not do the best job every time: so the question from the OP is still germane. 2017-06-25 19:18 GMT-07:00 Ryan : > Why would you like to

Re: Using SparkContext in Executors

2017-05-28 Thread Stephen Boesch
You would need to use *native* Cassandra API's in each Executor - not org.apache.spark.sql.cassandra.CassandraSQLContext - including to create a separate Cassandra connection on each Executor. 2017-05-28 15:47 GMT-07:00 Abdulfattah Safa : > So I can't run SQL queries in

Re: Jupyter spark Scala notebooks

2017-05-17 Thread Stephen Boesch
Jupyter with toree works well for my team. Jupyter is well more refined vs zeppelin as far as notebook features and usability: shortcuts, editing,etc. The caveat is it is better to run a separate server instanace for python/pyspark vs scala/spark 2017-05-17 19:27 GMT-07:00 Richard Moorhead

KTable like functionality in structured streaming

2017-05-16 Thread Stephen Fletcher
Are there any plans to add Kafka Streams KTable like functionality in structured streaming for kafka sources? Allowing querying keyed messages using spark sql,maybe calling KTables in the backend

Re: Spark books

2017-05-03 Thread Stephen Fletcher
Zeming, Jacek also has a really good online spark book for spark 2, "mastering spark". I found it very helpful when trying to understand spark 2's encoders. his book is here: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details On Wed, May 3, 2017 at 8:16 PM, Neelesh

Contributed to spark

2017-04-07 Thread Stephen Fletcher
I'd like to eventually contribute to spark, but I'm noticing since spark 2 the query planner is heavily used throughout Dataset code base. Are there any sites I can go to that explain the technical details, more than just from a high-level prospective

reducebykey

2017-04-07 Thread Stephen Fletcher
Are there plans to add reduceByKey to dataframes, Since switching over to spark 2 I find myself increasing dissatisfied with the idea of converting dataframes to RDD to do procedural programming on grouped data(both from a ease of programming stance and performance stance). So I've been using

Re: attempting to map Dataset[Row]

2017-02-26 Thread Stephen Fletcher
ow]() buff += row (key,buff) } } ... On Sun, Feb 26, 2017 at 7:31 AM, Stephen Fletcher < stephen.fletc...@gmail.com> wrote: > I'm attempting to perform a map on a Dataset[Row] but getting an error on > decode when attempting to pass a custom encoder. > My code loo

attempting to map Dataset[Row]

2017-02-26 Thread Stephen Fletcher
I'm attempting to perform a map on a Dataset[Row] but getting an error on decode when attempting to pass a custom encoder. My code looks similar to the following: val source = spark.read.format("parquet").load("/emrdata/sources/very_large_ds") source.map{ row => { val key = row(0)

pyspark in intellij

2017-02-25 Thread Stephen Boesch
Anyone have this working - either in 1.X or 2.X? thanks

Re: Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

2017-02-18 Thread Stephen Boesch
For now I have added to the log4j.properties: log4j.logger.org.apache.parquet=ERROR 2017-02-18 11:50 GMT-08:00 Stephen Boesch <java...@gmail.com>: > The following JIRA mentions that a fix made to read parquet 1.6.2 into 2.X > STILL leaves an "avalanche" of

Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

2017-02-18 Thread Stephen Boesch
The following JIRA mentions that a fix made to read parquet 1.6.2 into 2.X STILL leaves an "avalanche" of warnings: https://issues.apache.org/jira/browse/SPARK-17993 Here is the text inside one of the last comments before it was merged: I have built the code from the PR and it indeed

Re: Spark/Mesos with GPU support

2016-12-30 Thread Stephen Boesch
Would it be possible to share that communication? I am interested in this thread. 2016-12-30 11:02 GMT-08:00 Ji Yan : > Thanks Michael, Tim and I have touched base and thankfully the issue has > already been resolved > > On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt

Re: Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
This problem appears to be a regression on HEAD/master: when running against 2.0.2 the pyspark job completes successfully including running predictions. 2016-11-23 19:36 GMT-08:00 Stephen Boesch <java...@gmail.com>: > > For a pyspark job with 54 executors all of the task outputs h

Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
For a pyspark job with 54 executors all of the task outputs have a single line in both the stderr and stdout similar to: Error: invalid log directory /shared/sparkmaven/work/app-20161119222540-/0/ Note: the directory /shared/sparkmaven/work exists and is owned by the same user running the

Re: HPC with Spark? Simultaneous, parallel one to one mapping of partition to vcore

2016-11-19 Thread Stephen Boesch
While "apparently" saturating the N available workers using your proposed N partitions - the "actual" distribution of workers to tasks is controlled by the scheduler. If my past experience were of service - you can *not *trust the default Fair Scheduler to ensure the round-robin scheduling of the

Spark-packages

2016-11-06 Thread Stephen Boesch
What is the state of the spark-packages project(s) ? When running a query for machine learning algorithms the results are not encouraging. https://spark-packages.org/?q=tags%3A%22Machine%20Learning%22 There are 62 packages. Only a few have actual releases - and even less with dates in the past

Re: Use BLAS object for matrix operation

2016-11-03 Thread Stephen Boesch
It is private. You will need to put your code in that same package or create an accessor to it living within that package private[spark] 2016-11-03 16:04 GMT-07:00 Yanwei Zhang : > I would like to use some matrix operations in the BLAS object defined in > ml.linalg.

Re: Aggregation Calculation

2016-11-03 Thread Stephen Boesch
You would likely want to create inline views that perform the filtering *before *performing t he cubes/rollup; in this way the cubes/rollups only operate on the pruned rows/columns. 2016-11-03 11:29 GMT-07:00 Andrés Ivaldi : > Hello, I need to perform some aggregations and a

DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-12 Thread Stephen Hankinson
p cluster also works. Only when running on top of YARN do we see this issue. This also seems very similar to this issue: https://issues.apache. org/jira/browse/SPARK-10896 Thoughts? *Stephen Hankinson*

How to detect when a JavaSparkContext gets stopped

2016-09-05 Thread Hough, Stephen C
another request and tries to submit to the spark and gets a java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. Is there a way I can observe when the JavaSparkContext I own is stopped? Thanks Stephen This email and any attachments are confidential and may also

DataFrame equivalent to RDD.partionByKey

2016-08-09 Thread Stephen Fletcher
Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD? I'm reading data from a file data source and I want to partition this data I'm reading in to be partitioned the same way as the data I'm processing through a spark streaming RDD in the process.

Re: Logging trait in Spark 2.0

2016-06-28 Thread Stephen Boesch
I also did not understand why the Logging class was made private in Spark 2.0. In a couple of projects including CaffeOnSpark the Logging class was simply copied to the new project to allow for backwards compatibility. 2016-06-28 18:10 GMT-07:00 Michael Armbrust : > I'd

Custom Optimizer

2016-06-23 Thread Stephen Boesch
a choice between the internally defined Online or batch version. Any suggestions on how we might be able to incorporate our own optimizer? Or do we need to roll all of our algorithms from top to bottom - basically side stepping ml/mllib? thanks stephen

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
out.write(Opcodes.REDUCE) ^ 2016-06-22 23:49 GMT-07:00 Stephen Boesch <java...@gmail.com>: > Thanks Jeff - I remember that now from long time ago. After making that > change the next errors are: > > Error:scalac: missing or invalid dependency detected while lo

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
o > spark/external/flume-sink/target/scala-2.11/src_managed/main/compiled_avro > under build path, this is the only thing you need to do manually if I > remember correctly. > > > > On Thu, Jun 23, 2016 at 2:30 PM, Stephen Boesch <java...@gmail.com> wrote: > >>

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
ang <zjf...@gmail.com>: > It works well with me. You can try reimport it into intellij. > > On Thu, Jun 23, 2016 at 10:25 AM, Stephen Boesch <java...@gmail.com> > wrote: > >> >> Building inside intellij is an ever moving target. Anyone have the >> magical procedu

Building Spark 2.X in Intellij

2016-06-22 Thread Stephen Boesch
Building inside intellij is an ever moving target. Anyone have the magical procedures to get it going for 2.X? There are numerous library references that - although included in the pom.xml build - are for some reason not found when processed within Intellij.

Notebook(s) for Spark 2.0 ?

2016-06-20 Thread Stephen Boesch
Having looked closely at Jupyter, Zeppelin, and Spark-Notebook : only the latter seems to be close to having support for Spark 2.X. While I am interested in using Spark Notebook as soon as that support were available are there alternatives that work *now*? For example some unmerged -yet -working

Data Generators mllib -> ml

2016-06-20 Thread Stephen Boesch
There are around twenty data generators in mllib -none of which are presently migrated to ml. Here is an example /** * :: DeveloperApi :: * Generate sample data used for SVM. This class generates uniform random values * for the features and adds Gaussian noise with weight 0.1 to generate

Re: Python to Scala

2016-06-17 Thread Stephen Boesch
What are you expecting us to do? Yash provided a reasonable approach - based on the info you had provided in prior emails. Otherwise you can convert it from python to spark - or find someone else who feels comfortable to do it. That kind of inquiry would likelybe appropriate on a job board.

  1   2   3   >