Re: Dataproc serverless for Spark

2022-11-21 Thread Stephen Boesch
Out of curiosity : are there functional limitations in Spark Standalone that are of concern? Yarn is more configurable for running non-spark workloads and how to run multiple spark jobs in parallel. But for a single spark job it seems standalone launches more quickly and does not miss any

Re: Spark Scala Contract Opportunity @USA

2022-11-10 Thread Stephen Boesch
Please do not send advertisements on this channel. On Thu, 10 Nov 2022 at 13:40, sri hari kali charan Tummala < kali.tumm...@gmail.com> wrote: > Hi All, > > Is anyone looking for a spark scala contract role inside the USA? A > company called Maxonic has an open spark scala contract position

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch
I agree with Wim's assessment of data engineering / ETL vs Data Science. I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction. On Sat, 10 Oct 2020 at 13:03, Mich

Re: Kotlin Spark API

2020-07-14 Thread Stephen Boesch
{ println(it) } } } So that shows some of the niceness of kotlin: intuitive type conversion `to`/`to` and `dsOf( list)`- and also the inlining of the side effects. Overall concise and pleasant to read. On Tue, 14 Jul 2020 at 12:18, Stephen Boesch wrote: > I started with scala/spark in

Re: Kotlin Spark API

2020-07-14 Thread Stephen Boesch
I started with scala/spark in 2012 and scala has been my go-to language for six years. But I heartily applaud this direction. Kotlin is more like a simplified Scala - with the benefits that brings - than a simplified java. I particularly like the simplified / streamlined collections classes.

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Stephen Boesch
Spark in local mode (which is different than standalone) is a solution for many use cases. I use it in conjunction with (and sometimes instead of) pandas/pandasql due to its much wider ETL related capabilities. On the JVM side it is an even more obvious choice - given there is no equivalent to

Re: Hey good looking toPandas ()

2020-06-19 Thread Stephen Boesch
afaik It has been there since Spark 2.0 in 2015. Not certain about Spark 1.5/1.6 On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan wrote: > I first ran the command > df.show() > > For sanity check of my dataFrame. > > I wasn't impressed with the display. > > I then ran > df.toPandas() in Jupiter

Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
the predicates are typically sql's. Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch : > Hi Mich! >I think you can combine the good/rejected into one method that > internally: > >- Create good/rejected df's given an input df and input >rules/predicates to apply to the

Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
Hi Mich! I think you can combine the good/rejected into one method that internally: - Create good/rejected df's given an input df and input rules/predicates to apply to the df. - Create a third df containing the good rows and the rejected rows with the bad columns nulled out -

Re: Going it alone.

2020-04-16 Thread Stephen Boesch
The warning signs were there from the first email sent from that person. I wonder is there any way to deal with this more proactively. Am Do., 16. Apr. 2020 um 10:54 Uhr schrieb Mich Talebzadeh < mich.talebza...@gmail.com>: > good for you. right move > > Dr Mich Talebzadeh > > > > LinkedIn * >

Re: IDE suitable for Spark

2020-04-07 Thread Stephen Boesch
I have been using Idea for both scala/spark and pyspark projects since 2013. It required fair amount of fiddling that first year but has been stable since early 2015. For pyspark projects only Pycharm naturally also works v well. Am Di., 7. Apr. 2020 um 09:10 Uhr schrieb yeikel valdes : > >

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
same code. why running them two different ways vary so much in the > execution time. > > > > > *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* > > > On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch wrote: > >> Sounds like you have done your homework to

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Sounds like you have done your homework to properly compare . I'm guessing the answer to the following is yes .. but in any case: are they both running against the same spark cluster with the same configuration parameters especially executor memory and number of workers? Am Di., 10. Sept. 2019

Re: Incremental (online) machine learning algorithms on ML

2019-08-05 Thread Stephen Boesch
There are several high bars to getting a new algorithm adopted. * It needs to be deemed by the MLLib committers/shepherds as widely useful to the community. Algorithms offered by larger companies after having demonstrated usefulness at scale for use cases likely to be encountered by many

How to execute non-timestamp-based aggregations in spark structured streaming?

2019-04-20 Thread Stephen Boesch
Consider the following *intended* sql: select row_number() over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,* from flights This will *not* work in *structured streaming* : The culprit is: partition by Origin The requirement is to use a timestamp-typed field such as

Re: spark-sklearn

2019-04-08 Thread Stephen Boesch
There are several suggestions on this SOF https://stackoverflow.com/questions/38984775/spark-errorexpected-zero-arguments-for-construction-of-classdict-for-numpy-cor 1 You need to convert the final value to a python list. You implement the function as follows: def uniq_array(col_array): x =

Re: Build spark source code with scala 2.11

2019-03-12 Thread Stephen Boesch
You might have better luck downloading the 2.4.X branch Am Di., 12. März 2019 um 16:39 Uhr schrieb swastik mittal : > Then are the mlib of spark compatible with scala 2.12? Or can I change the > spark version from spark3.0 to 2.3 or 2.4 in local spark/master? > > > > -- > Sent from:

Re: Build spark source code with scala 2.11

2019-03-12 Thread Stephen Boesch
I think scala 2.11 support was removed with the spark3.0/master Am Di., 12. März 2019 um 16:26 Uhr schrieb swastik mittal : > I am trying to build my spark using build/sbt package, after changing the > scala versions to 2.11 in pom.xml because my applications jar files use > scala 2.11. But

Re: Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread Stephen Boesch
So the LogisticRegression with regParam and elasticNetParam set to 0 is not what you are looking for? https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#logistic-regression .setRegParam(0.0) .setElasticNetParam(0.0) Am Do., 11. Okt. 2018 um 15:46 Uhr schrieb pikufolgado

Fixing NullType for parquet files

2018-09-12 Thread Stephen Boesch
sues.apache.org/jira/browse/SPARK-10943?focusedCommentId=16462797=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16462797> [image: javadba]Stephen Boesch <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=javadba> added a comment - 03/May/18 17:08

Re: [announce] BeakerX supports Scala+Spark in Jupyter

2018-06-07 Thread Stephen Boesch
Assuming that the spark 2.X kernel (e.g. toree) were chosen for a given jupyter notebook and there is a Cell 3 that contains some Spark DataFrame operations .. Then : - what is the relationship does the %%spark magic and the toree kernel? - how does the %%spark magic get applied to that

Re: Guava dependency issue

2018-05-08 Thread Stephen Boesch
(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6 2018-05-07 10:30 GMT-07:00 Stephen Boesch <java...@gmail.com>: > I am intermittently running into guava dependency issues across mutiple > spark projects. I have tried

Guava dependency issue

2018-05-07 Thread Stephen Boesch
I am intermittently running into guava dependency issues across mutiple spark projects. I have tried maven shade / relocate but it does not resolve the issues. The current project is extremely simple: *no* additional dependencies beyond scala, spark, and scalatest - yet the issues remain (and

Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
> On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <java...@gmail.com> wrote: > >> Do you have a machine with terabytes of RAM? afaik collect() requires >> RAM - so that would be your limiting factor. >> >> 2018-04-28 8:41 GMT-07:00 klrmowse <klrmo...@gmail

Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
Do you have a machine with terabytes of RAM? afaik collect() requires RAM - so that would be your limiting factor. 2018-04-28 8:41 GMT-07:00 klrmowse : > i am currently trying to find a workaround for the Spark application i am > working on so that it does not have to use

Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Stephen Boesch
While MLLib performed favorably vs Flink it *also *performed favorably vs spark.ml .. and by an *order of magnitude*. The following is one of the tables - it is for Logistic Regression. At that time spark.ML did not yet support SVM From: https://bdataanalytics.biomedcentral.com/articles/10.

Re: Anyone know where to find independent contractors in New York?

2017-12-21 Thread Stephen Boesch
Hi Richard, this is not a jobs board: please only discuss spark application development issues. 2017-12-21 8:34 GMT-08:00 Richard L. Burton III : > I'm trying to locate four independent contractors who have experience with > Spark. I'm not sure where I can go to find

Re: LDA and evaluating topic number

2017-12-07 Thread Stephen Boesch
I have been testing on the 20 NewsGroups dataset - which the Spark docs themselves reference. I can confirm that perplexity increases and likelihood decreases as topics increase - and am similarly confused by these results. 2017-09-28 10:50 GMT-07:00 Cody Buntain : > Hi,

Weight column values not used in Binary Logistic Regression Summary

2017-11-18 Thread Stephen Boesch
In BinaryLogisticRegressionSummary there are @Since("1.5.0") tags on a number of comments identical to the following: * @note This ignores instance weights (setting all to 1.0) from `LogisticRegression.weightCol`. * This will change in later Spark versions. Are there any plans to address this?

Re: Spark streaming for CEP

2017-10-24 Thread Stephen Boesch
Hi Mich, the github link has a brief intro - including a link to the formal docs http://logisland.readthedocs.io/en/latest/index.html . They have an architectural overview, developer guide, tutorial, and pretty comprehensive api docs. 2017-10-24 13:31 GMT-07:00 Mich Talebzadeh

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Stephen Boesch
@Vadim Would it be true to say the `.rdd` *may* be creating a new job - depending on whether the DataFrame/DataSet had already been materialized via an action or checkpoint? If the only prior operations on the DataFrame had been transformations then the dataframe would still not have been

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
repo - The local maven repo is included by default - so should not need to do anything special there The same errors from the original post continue to occur. 2017-10-11 20:05 GMT-07:00 Stephen Boesch <java...@gmail.com>: > A clarification here: the example is being run *from

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
n install and > define your local maven repo in SBT? > > -Paul > > Sent from my iPhone > > On Oct 11, 2017, at 5:48 PM, Stephen Boesch <java...@gmail.com> wrote: > > When attempting to run any example program w/ Intellij I am running into > guava ver

Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
When attempting to run any example program w/ Intellij I am running into guava versioning issues: Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/cache/CacheLoader at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73) at

Re: SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
Cheers > Jules > > Sent from my iPhone > Pardon the dumb thumb typos :) > > > On Aug 10, 2017, at 1:46 PM, Stephen Boesch <java...@gmail.com> wrote: > > > > > > While the DataFrame/DataSets are useful in many circumstances they are > cumbersome for many ty

SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
While the DataFrame/DataSets are useful in many circumstances they are cumbersome for many types of complex sql queries. Is there an up to date *SQL* reference - i.e. not DataFrame DSL operations - for version 2.2? An example of what is not clear: what constructs are supported within

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Stephen Boesch
Spark SQL did not support explicit partitioners even before tungsten: and often enough this did hurt performance. Even now Tungsten will not do the best job every time: so the question from the OP is still germane. 2017-06-25 19:18 GMT-07:00 Ryan : > Why would you like to

Re: Using SparkContext in Executors

2017-05-28 Thread Stephen Boesch
You would need to use *native* Cassandra API's in each Executor - not org.apache.spark.sql.cassandra.CassandraSQLContext - including to create a separate Cassandra connection on each Executor. 2017-05-28 15:47 GMT-07:00 Abdulfattah Safa : > So I can't run SQL queries in

Re: Jupyter spark Scala notebooks

2017-05-17 Thread Stephen Boesch
Jupyter with toree works well for my team. Jupyter is well more refined vs zeppelin as far as notebook features and usability: shortcuts, editing,etc. The caveat is it is better to run a separate server instanace for python/pyspark vs scala/spark 2017-05-17 19:27 GMT-07:00 Richard Moorhead

pyspark in intellij

2017-02-25 Thread Stephen Boesch
Anyone have this working - either in 1.X or 2.X? thanks

Re: Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

2017-02-18 Thread Stephen Boesch
For now I have added to the log4j.properties: log4j.logger.org.apache.parquet=ERROR 2017-02-18 11:50 GMT-08:00 Stephen Boesch <java...@gmail.com>: > The following JIRA mentions that a fix made to read parquet 1.6.2 into 2.X > STILL leaves an "avalanche" of

Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

2017-02-18 Thread Stephen Boesch
The following JIRA mentions that a fix made to read parquet 1.6.2 into 2.X STILL leaves an "avalanche" of warnings: https://issues.apache.org/jira/browse/SPARK-17993 Here is the text inside one of the last comments before it was merged: I have built the code from the PR and it indeed

Re: Spark/Mesos with GPU support

2016-12-30 Thread Stephen Boesch
Would it be possible to share that communication? I am interested in this thread. 2016-12-30 11:02 GMT-08:00 Ji Yan : > Thanks Michael, Tim and I have touched base and thankfully the issue has > already been resolved > > On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt

Re: Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
This problem appears to be a regression on HEAD/master: when running against 2.0.2 the pyspark job completes successfully including running predictions. 2016-11-23 19:36 GMT-08:00 Stephen Boesch <java...@gmail.com>: > > For a pyspark job with 54 executors all of the task outputs h

Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
For a pyspark job with 54 executors all of the task outputs have a single line in both the stderr and stdout similar to: Error: invalid log directory /shared/sparkmaven/work/app-20161119222540-/0/ Note: the directory /shared/sparkmaven/work exists and is owned by the same user running the

Re: HPC with Spark? Simultaneous, parallel one to one mapping of partition to vcore

2016-11-19 Thread Stephen Boesch
While "apparently" saturating the N available workers using your proposed N partitions - the "actual" distribution of workers to tasks is controlled by the scheduler. If my past experience were of service - you can *not *trust the default Fair Scheduler to ensure the round-robin scheduling of the

Spark-packages

2016-11-06 Thread Stephen Boesch
What is the state of the spark-packages project(s) ? When running a query for machine learning algorithms the results are not encouraging. https://spark-packages.org/?q=tags%3A%22Machine%20Learning%22 There are 62 packages. Only a few have actual releases - and even less with dates in the past

Re: Use BLAS object for matrix operation

2016-11-03 Thread Stephen Boesch
It is private. You will need to put your code in that same package or create an accessor to it living within that package private[spark] 2016-11-03 16:04 GMT-07:00 Yanwei Zhang : > I would like to use some matrix operations in the BLAS object defined in > ml.linalg.

Re: Aggregation Calculation

2016-11-03 Thread Stephen Boesch
You would likely want to create inline views that perform the filtering *before *performing t he cubes/rollup; in this way the cubes/rollups only operate on the pruned rows/columns. 2016-11-03 11:29 GMT-07:00 Andrés Ivaldi : > Hello, I need to perform some aggregations and a

Re: Logging trait in Spark 2.0

2016-06-28 Thread Stephen Boesch
I also did not understand why the Logging class was made private in Spark 2.0. In a couple of projects including CaffeOnSpark the Logging class was simply copied to the new project to allow for backwards compatibility. 2016-06-28 18:10 GMT-07:00 Michael Armbrust : > I'd

Custom Optimizer

2016-06-23 Thread Stephen Boesch
My team has a custom optimization routine that we would have wanted to plug in as a replacement for the default LBFGS / OWLQN for use by some of the ml/mllib algorithms. However it seems the choice of optimizer is hard-coded in every algorithm except LDA: and even in that one it is only a

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
out.write(Opcodes.REDUCE) ^ 2016-06-22 23:49 GMT-07:00 Stephen Boesch <java...@gmail.com>: > Thanks Jeff - I remember that now from long time ago. After making that > change the next errors are: > > Error:scalac: missing or invalid dependency detected while lo

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
o > spark/external/flume-sink/target/scala-2.11/src_managed/main/compiled_avro > under build path, this is the only thing you need to do manually if I > remember correctly. > > > > On Thu, Jun 23, 2016 at 2:30 PM, Stephen Boesch <java...@gmail.com> wrote: > >>

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
ang <zjf...@gmail.com>: > It works well with me. You can try reimport it into intellij. > > On Thu, Jun 23, 2016 at 10:25 AM, Stephen Boesch <java...@gmail.com> > wrote: > >> >> Building inside intellij is an ever moving target. Anyone have the >> magical procedu

Building Spark 2.X in Intellij

2016-06-22 Thread Stephen Boesch
Building inside intellij is an ever moving target. Anyone have the magical procedures to get it going for 2.X? There are numerous library references that - although included in the pom.xml build - are for some reason not found when processed within Intellij.

Notebook(s) for Spark 2.0 ?

2016-06-20 Thread Stephen Boesch
Having looked closely at Jupyter, Zeppelin, and Spark-Notebook : only the latter seems to be close to having support for Spark 2.X. While I am interested in using Spark Notebook as soon as that support were available are there alternatives that work *now*? For example some unmerged -yet -working

Data Generators mllib -> ml

2016-06-20 Thread Stephen Boesch
There are around twenty data generators in mllib -none of which are presently migrated to ml. Here is an example /** * :: DeveloperApi :: * Generate sample data used for SVM. This class generates uniform random values * for the features and adds Gaussian noise with weight 0.1 to generate

Re: Python to Scala

2016-06-17 Thread Stephen Boesch
What are you expecting us to do? Yash provided a reasonable approach - based on the info you had provided in prior emails. Otherwise you can convert it from python to spark - or find someone else who feels comfortable to do it. That kind of inquiry would likelybe appropriate on a job board.

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Stephen Boesch
How many workers (/cpu cores) are assigned to this job? 2016-06-09 13:01 GMT-07:00 SRK : > Hi, > > How to insert data into 2000 partitions(directories) of ORC/parquet at a > time using Spark SQL? It seems to be not performant when I try to insert > 2000 directories of

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Stephen Boesch
ooc are the tables partitioned on a.pk and b.fk? Hive might be using copartitioning in that case: it is one of hive's strengths. 2016-06-09 7:28 GMT-07:00 Gourav Sengupta : > Hi Mich, > > does not Hive use map-reduce? I thought it to be so. And since I am > running

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-29 Thread Stephen Boesch
e/SPARK-7159 > On May 28, 2016 9:31 PM, "Stephen Boesch" <java...@gmail.com> wrote: > >> Thanks Phuong But the point of my post is how to achieve without using >> the deprecated the mllib pacakge. The mllib package already has >> multinomial regression buil

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
ogisticGradient is in the mllib package, not ml > package. I just want to say that we can build a multinomial logistic > regression model from the current version of Spark. > > Regards, > > Phuong > > > > On Sun, May 29, 2016 at 12:04 AM, Stephen Boes

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
gradient and loss > for a multinomial logistic regression. That is, you can train a > multinomial logistic regression model with LogisticGradient and a > class to solve optimization like LBFGS to get a weight vector of the > size (numClassrd-1)*numFeatures. > > > Phuong > > >

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Followup: just encountered the "OneVsRest" classifier in ml.classsification: I will look into using it with the binary LogisticRegression as the provided classifier. 2016-05-28 9:06 GMT-07:00 Stephen Boesch <java...@gmail.com>: > > Presently only the mllib version has t

Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Presently only the mllib version has the one-vs-all approach for multinomial support. The ml version with ElasticNet support only allows binary regression. With feature parity of ml vs mllib having been stated as an objective for 2.0.0 - is there a projected availability of the multinomial

Re: How to use the spark submit script / capability

2016-05-15 Thread Stephen Boesch
e PR, check Spark's API > documentation. > > > On Sun, May 15, 2016 at 9:33 AM, Stephen Boesch <java...@gmail.com> wrote: > > > > There is a committed PR from Marcelo Vanzin addressing that capability: > > > > https://github.com/apache/spark/pull/3916/files &

How to use the spark submit script / capability

2016-05-15 Thread Stephen Boesch
There is a committed PR from Marcelo Vanzin addressing that capability: https://github.com/apache/spark/pull/3916/files Is there any documentation on how to use this? The PR itself has two comments asking for the docs that were not answered.

Re: spark task scheduling delay

2016-01-20 Thread Stephen Boesch
Which Resource Manager are you using? 2016-01-20 21:38 GMT-08:00 Renu Yadav : > Any suggestions? > > On Wed, Jan 20, 2016 at 6:50 PM, Renu Yadav wrote: > >> Hi , >> >> I am facing spark task scheduling delay issue in spark 1.4. >> >> suppose I have 1600

Re: Recommendations using Spark

2016-01-07 Thread Stephen Boesch
Alternating least squares takes an RDD of (user/product/ratings) tuples and the resulting Model provides predict(user, product) or predictProducts methods among others.

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Stephen Boesch
The postgres jdbc driver needs to be added to the classpath of your spark workers. You can do a search for how to do that (multiple ways). 2015-12-22 17:22 GMT-08:00 b2k70 : > I see in the Spark SQL documentation that a temporary table can be created > directly onto a

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Stephen Boesch
y other things that I have to do that you can think of? > > Thanks, > Ben > > > On Dec 22, 2015, at 6:25 PM, Stephen Boesch <java...@gmail.com> wrote: > > The postgres jdbc driver needs to be added to the classpath of your spark > workers. You can do a search for how

Re: Scala VS Java VS Python

2015-12-16 Thread Stephen Boesch
There are solid reasons to have built spark on the jvm vs python. The question for Daniel appear to be at this point scala vs java8. For that there are many comparisons already available: but in the case of working with spark there is the additional benefit for the scala side that the core

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Stephen Boesch
@Yu Fengdong: Your approach - specifically the groupBy results in a shuffle does it not? 2015-12-04 2:02 GMT-08:00 Fengdong Yu : > There are many ways, one simple is: > > such as: you want to know how many rows for each month: > > >

Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch
r.t. building locally, please specify -Pscala-2.11 > > Cheers > > On Tue, Nov 24, 2015 at 9:58 AM, Stephen Boesch <java...@gmail.com> wrote: > >> HI Madabhattula >> Scala 2.11 requires building from source. Prebuilt binaries are >> available only for sca

Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch
HI Madabhattula Scala 2.11 requires building from source. Prebuilt binaries are available only for scala 2.10 >From the src folder: dev/change-scala-version.sh 2.11 Then build as you would normally either from mvn or sbt The above info *is* included in the spark docs but a little hard

Re: Spark-SQL idiomatic way of adding a new partition or writing to Partitioned Persistent Table

2015-11-22 Thread Stephen Boesch
>> and then use the Hive's dynamic partitioned insert syntax What does this entail? Same sql but you need to do set hive.exec.dynamic.partition = true; in the hive/sql context (along with several other related dynamic partition settings.) Is there anything else/special

Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
The following works against a hive table from spark sql hc.sql("select id,r from (select id, name, rank() over (order by name) as r from tt2) v where v.r >= 1 and v.r <= 12") But when using a standard sql context against a temporary table the following occurs: Exception in thread "main"

Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
Checked out 1.6.0-SNAPSHOT 60 minutes ago 2015-11-18 19:19 GMT-08:00 Jack Yang <j...@uow.edu.au>: > Which version of spark are you using? > > > > *From:* Stephen Boesch [mailto:java...@gmail.com] > *Sent:* Thursday, 19 November 2015 2:12 PM > *To:* user > *Su

Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
But to focus the attention properly: I had already tried out 1.5.2. 2015-11-18 19:46 GMT-08:00 Stephen Boesch <java...@gmail.com>: > Checked out 1.6.0-SNAPSHOT 60 minutes ago > > 2015-11-18 19:19 GMT-08:00 Jack Yang <j...@uow.edu.au>: > >> Which version of spark ar

Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
Why is the same query (and actually i tried several variations) working against a hivecontext and not against the sql context? 2015-11-18 19:57 GMT-08:00 Michael Armbrust <mich...@databricks.com>: > Yes they do. > > On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch <java..

Re: Examples module not building in intellij

2015-10-04 Thread Stephen Boesch
ust the last thing that happens to > fail. > > On Sun, Oct 4, 2015 at 7:06 AM, Stephen Boesch <java...@gmail.com> wrote: > > > > For a week or two the trunk has not been building for the examples module > > within intellij. The other modules - including core, sql,

Examples module not building in intellij

2015-10-04 Thread Stephen Boesch
For a week or two the trunk has not been building for the examples module within intellij. The other modules - including core, sql, mllib, etc *are * working. A portion of the error message is "Unable to get dependency information: Unable to read the metadata file for artifact

Re: Breakpoints not hit with Scalatest + intelliJ

2015-09-18 Thread Stephen Boesch
Hi Michel, please try local[1] and report back if the breakpoint were hit. 2015-09-18 7:37 GMT-07:00 Michel Lemay : > Hi, > > I'm adding unit tests to some utility functions that are using > SparkContext but I'm unable to debug code and hit breakpoints when running > under

Re: How to restrict java unit tests from the maven command line

2015-09-10 Thread Stephen Boesch
Yes, adding that flag does the trick. thanks. 2015-09-10 13:47 GMT-07:00 Sean Owen <so...@cloudera.com>: > -Dtest=none ? > > > https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests > > On Thu, Sep 10, 2015 at

How to restrict java unit tests from the maven command line

2015-09-10 Thread Stephen Boesch
I have invoked mvn test with the -DwildcardSuites option to specify a single BinarizerSuite scalatest suite. The command line is mvn -pl mllib -Pyarn -Phadoop-2.6 -Dhadoop2.7.1 -Dscala-2.11 -Dmaven.javadoc.skip=true -DwildcardSuites=org.apache.spark.ml.feature.BinarizerSuite test The scala

Re: Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-17 Thread Stephen Boesch
. FYI On Sun, Aug 16, 2015 at 11:12 AM, Stephen Boesch java...@gmail.com wrote: I am building spark with the following options - most notably the **scala-2.11**: . dev/switch-to-scala-2.11.sh mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests -Dmaven.javadoc.skip

Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-16 Thread Stephen Boesch
I am building spark with the following options - most notably the **scala-2.11**: . dev/switch-to-scala-2.11.sh mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests -Dmaven.javadoc.skip=true clean package The build goes pretty far but fails in one of the minor modules

Re: Error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-14 Thread Stephen Boesch
The NoClassDefFoundException differs from ClassNotFoundException : it indicates an error while initializing that class: but the class is found in the classpath. Please provide the full stack trace. 2015-08-14 4:59 GMT-07:00 stelsavva stel...@avocarrot.com: Hello, I am just starting out with

Spark-submit not finding main class and the error reflects different path to jar file than specified

2015-08-06 Thread Stephen Boesch
Given the following command line to spark-submit: bin/spark-submit --verbose --master local[2]--class org.yardstick.spark.SparkCoreRDDBenchmark /shared/ysgood/target/yardstick-spark-uber-0.0.1.jar Here is the output: NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes

Which directory contains third party libraries for Spark

2015-07-27 Thread Stephen Boesch
when using spark-submit: which directory contains third party libraries that will be loaded on each of the slaves? I would like to scp one or more libraries to each of the slaves instead of shipping the contents in the application uber-jar. Note: I did try adding to $SPARK_HOME/lib_managed/jars.

Re: spark benchmarking

2015-07-08 Thread Stephen Boesch
One option is the databricks/spark-perf project https://github.com/databricks/spark-perf 2015-07-08 11:23 GMT-07:00 MrAsanjar . afsan...@gmail.com: Hi all, What is the most common used tool/product to benchmark spark job?

Catalyst Errors when building spark from trunk

2015-07-07 Thread Stephen Boesch
The following errors are occurring upon building using mvn options clean package Are there some requirements/restrictions on profiles/settings for catalyst to build properly? [error] /shared/sparkup2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:138: value

Re: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Stephen Boesch
Vanilla map/reduce does not expose it: but hive on top of map/reduce has superior partitioning (and bucketing) support to Spark. 2015-06-28 13:44 GMT-07:00 Koert Kuipers ko...@tresata.com: spark is partitioner aware, so it can exploit a situation where 2 datasets are partitioned the same way

Re: Velox Model Server

2015-06-21 Thread Stephen Boesch
Oryx 2 has a scala client https://github.com/OryxProject/oryx/blob/master/framework/oryx-api/src/main/scala/com/cloudera/oryx/api/ 2015-06-20 11:39 GMT-07:00 Debasish Das debasish.da...@gmail.com: After getting used to Scala, writing Java is too much work :-) I am looking for scala based

Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-01 Thread Stephen Boesch
I downloaded the 1.3.1 distro tarball $ll ../spark-1.3.1.tar.gz -rw-r-@ 1 steve staff 8500861 Apr 23 09:58 ../spark-1.3.1.tar.gz However the build on it is failing with an unresolved dependency: *configuration not public* $ build/sbt assembly -Dhadoop.version=2.5.2 -Pyarn -Phadoop-2.4

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Stephen Boesch
(the same btw applies for the Node where you run the driver app – all other nodes must be able to resolve its name) *From:* Stephen Boesch [mailto:java...@gmail.com] *Sent:* Wednesday, May 20, 2015 10:07 AM *To:* user *Subject:* Intermittent difficulties for Worker to contact Master on same

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Stephen Boesch
TestRunner: power-iteration-clustering 8 512.0 MB 2015/05/27 12:44:03 steve FINISHED 6 s app-20150527123822- TestRunner: power-iteration-clustering 8 512.0 MB 2015/05/27 12:38:22 steve FINISHED 6 s 2015-05-27 11:42 GMT-07:00 Stephen Boesch java...@gmail.com: Thanks Yana, My current

Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Stephen Boesch
What conditions would cause the following delays / failure for a standalone machine/cluster to have the Worker contact the Master? 15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at http://10.0.0.3:8081 15/05/20 02:02:53 INFO Worker: Connecting to master

Re: Code error

2015-05-19 Thread Stephen Boesch
Hi Ricardo, providing the error output would help . But in any case you need to do a collect() on the rdd returned from computeCost. 2015-05-19 11:59 GMT-07:00 Ricardo Goncalves da Silva ricardog.si...@telefonica.com: Hi, Can anybody see what’s wrong in this piece of code:

Re: Building Spark

2015-05-13 Thread Stephen Boesch
Hi Akhil, Building with sbt tends to need around 3.5GB whereas maven requirements are much lower , around 1.7GB. So try using maven . For reference I have the following settings and both do compile. sbt would not work with lower values. $echo $SBT_OPTS -Xmx3012m -XX:MaxPermSize=512m

  1   2   >