Re: Is this list alive? I need help

2024-02-23 Thread Stephen Boesch
The list is alive! (play on old IMAX film *The Dream is Alive*). But does it breathe? Not sure, I have not done solr in over a decade. On Fri, 23 Feb 2024 at 10:05, Beale, Jim (US-KOP) wrote: > I have a Solrcloud installation of three servers on three r5.xlarge EC2 > with a shared disk drive u

Re: Dataproc serverless for Spark

2022-11-21 Thread Stephen Boesch
Out of curiosity : are there functional limitations in Spark Standalone that are of concern? Yarn is more configurable for running non-spark workloads and how to run multiple spark jobs in parallel. But for a single spark job it seems standalone launches more quickly and does not miss any features

Re: Spark Scala Contract Opportunity @USA

2022-11-10 Thread Stephen Boesch
Please do not send advertisements on this channel. On Thu, 10 Nov 2022 at 13:40, sri hari kali charan Tummala < kali.tumm...@gmail.com> wrote: > Hi All, > > Is anyone looking for a spark scala contract role inside the USA? A > company called Maxonic has an open spark scala contract position (100%

Re: Solr russification

2021-08-10 Thread Stephen Boesch
automated translations are rough at best On Tue, 10 Aug 2021 at 06:16, Dave wrote: > Cant chrome translate the page or is there too much JavaScript? > > > On Aug 10, 2021, at 8:44 AM, Eric Pugh > wrote: > > > > Nothing built in today. The Solr GUI is written in AngularJS, and > while it refe

Re: mission statement : unified

2020-10-25 Thread Stephen Boesch
While the core of the Spark is and has been quite solid and a go-to infrastructure, the *streaming *part of the story was still quite weak at least through mid last year. I went into depth on both structured and the older DStream. The structured in particular was difficult to use: both in terms o

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch
I agree with Wim's assessment of data engineering / ETL vs Data Science. I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction. On Sat, 10 Oct 2020 at 13:03, Mich Ta

Re: Removing Hive-on-Spark

2020-07-27 Thread Stephen Boesch
Why would it be this way instead of the other way around? On Mon, 27 Jul 2020 at 12:27, David wrote: > Hello Hive Users. > > I am interested in gathering some feedback on the adoption of > Hive-on-Spark. > > Does anyone care to volunteer their usage information and would you be > open to removin

Re: Kotlin Spark API

2020-07-14 Thread Stephen Boesch
{ println(it) } } } So that shows some of the niceness of kotlin: intuitive type conversion `to`/`to` and `dsOf( list)`- and also the inlining of the side effects. Overall concise and pleasant to read. On Tue, 14 Jul 2020 at 12:18, Stephen Boesch wrote: > I started with scala/spark in

Re: Kotlin Spark API

2020-07-14 Thread Stephen Boesch
I started with scala/spark in 2012 and scala has been my go-to language for six years. But I heartily applaud this direction. Kotlin is more like a simplified Scala - with the benefits that brings - than a simplified java. I particularly like the simplified / streamlined collections classes. Reall

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Stephen Boesch
Spark in local mode (which is different than standalone) is a solution for many use cases. I use it in conjunction with (and sometimes instead of) pandas/pandasql due to its much wider ETL related capabilities. On the JVM side it is an even more obvious choice - given there is no equivalent to pand

Re: [VOTE] Decommissioning SPIP

2020-07-01 Thread Stephen Boesch
+1 Thx for seeing this through On Wed, 1 Jul 2020 at 20:03, Imran Rashid wrote: > +1 > > I think this is going to be a really important feature for Spark and I'm > glad to see Holden focusing on it. > > On Wed, Jul 1, 2020 at 8:38 PM Mridul Muralidharan > wrote: > >> +1 >> >> Thanks, >> Mridul

Re: Initial Decom PR for Spark 3?

2020-06-22 Thread Stephen Boesch
gt; draft for comment by the end of Spark summit. I'll be using the same design >> document for the design component, so if anyone has input on the design >> document feel free to start leaving comments there now. >> >> On Sat, Jun 20, 2020 at 4:23 PM Stephen Boesch wro

Re: Initial Decom PR for Spark 3?

2020-06-20 Thread Stephen Boesch
Hi given there is a design doc (contrary to that common) - is this going to move forward? On Thu, 18 Jun 2020 at 18:05, Hyukjin Kwon wrote: > Looks it had to be with SPIP and a proper design doc to discuss. > > 2020년 2월 9일 (일) 오전 1:23, Erik Erlandson 님이 작성: > >> I'd be willing to pull this in, u

Re: Hey good looking toPandas ()

2020-06-19 Thread Stephen Boesch
afaik It has been there since Spark 2.0 in 2015. Not certain about Spark 1.5/1.6 On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan wrote: > I first ran the command > df.show() > > For sanity check of my dataFrame. > > I wasn't impressed with the display. > > I then ran > df.toPandas() in Jupiter N

Re: Initial Decom PR for Spark 3?

2020-06-18 Thread Stephen Boesch
Second paragraph of the PR lists the design doc. > There is a design document at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing On Thu, 18 Jun 2020 at 18:05, Hyukjin Kwon wrote: > Looks it had to be with SPIP and a proper design doc to discuss.

Re: life after Hortonworks

2020-05-11 Thread Stephen Boesch
I am reading between the lines that ambari is no longer a strategic platform. Would someone please provide a link/reference to a Cloudera press release or blog describing this and maybe related decisions/roadmaps? thx! Am Mo., 11. Mai 2020 um 10:05 Uhr schrieb Aaron Bossert < aa...@punchcyber.com

Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
predicates are typically sql's. Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch : > Hi Mich! >I think you can combine the good/rejected into one method that > internally: > >- Create good/rejected df's given an input df and input >rules/predicates to appl

Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
Hi Mich! I think you can combine the good/rejected into one method that internally: - Create good/rejected df's given an input df and input rules/predicates to apply to the df. - Create a third df containing the good rows and the rejected rows with the bad columns nulled out - Ap

Re: Going it alone.

2020-04-16 Thread Stephen Boesch
The warning signs were there from the first email sent from that person. I wonder is there any way to deal with this more proactively. Am Do., 16. Apr. 2020 um 10:54 Uhr schrieb Mich Talebzadeh < mich.talebza...@gmail.com>: > good for you. right move > > Dr Mich Talebzadeh > > > > LinkedIn * > h

Re: IDE suitable for Spark

2020-04-07 Thread Stephen Boesch
I have been using Idea for both scala/spark and pyspark projects since 2013. It required fair amount of fiddling that first year but has been stable since early 2015. For pyspark projects only Pycharm naturally also works v well. Am Di., 7. Apr. 2020 um 09:10 Uhr schrieb yeikel valdes : > > Ze

Re: Calling Scala function from angular submit button

2019-10-04 Thread Stephen Boesch
;re saying but I'm not sure how to go about it. > Any tips or good tutorials on it to point me in the right direction. > Thanks for the response. > > On Fri, Oct 4, 2019 at 5:58 PM Stephen Boesch wrote: > >> You'll need to start a listener/server on the scala end and

Re: Calling Scala function from angular submit button

2019-10-04 Thread Stephen Boesch
You'll need to start a listener/server on the scala end and communicate vai a websocket connection from angular. Am Fr., 4. Okt. 2019 um 13:00 Uhr schrieb Joshua Ochsankehl < joshua.ochsank...@gmail.com>: > Is it possible to pass a value to a spark/scala function from > an angular submit button?

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
exact same code. why running them two different ways vary so much in the > execution time. > > > > > *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* > > > On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch wrote: > >> Sounds like you have done your homewo

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
exact same code. why running them two different ways vary so much in the > execution time. > > > > > *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* > > > On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch wrote: > >> Sounds like you have done your homewo

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Sounds like you have done your homework to properly compare . I'm guessing the answer to the following is yes .. but in any case: are they both running against the same spark cluster with the same configuration parameters especially executor memory and number of workers? Am Di., 10. Sept. 2019

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Sounds like you have done your homework to properly compare . I'm guessing the answer to the following is yes .. but in any case: are they both running against the same spark cluster with the same configuration parameters especially executor memory and number of workers? Am Di., 10. Sept. 2019

Re: Incremental (online) machine learning algorithms on ML

2019-08-05 Thread Stephen Boesch
There are several high bars to getting a new algorithm adopted. * It needs to be deemed by the MLLib committers/shepherds as widely useful to the community. Algorithms offered by larger companies after having demonstrated usefulness at scale for use cases likely to be encountered by many othe

How to execute non-timestamp-based aggregations in spark structured streaming?

2019-04-20 Thread Stephen Boesch
Consider the following *intended* sql: select row_number() over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,* from flights This will *not* work in *structured streaming* : The culprit is: partition by Origin The requirement is to use a timestamp-typed field such as par

Re: Kafka tuning - consultant work

2019-04-15 Thread Stephen Boesch
Please refrain from using this list as a job board. thank you. Am Mo., 15. Apr. 2019 um 07:00 Uhr schrieb Manoj Murumkar < manoj.murum...@gmail.com>: > Damian, > > Let me know when we can talk. I have done extensive work on Kafka and run > a boutique consulting firm specializes in this work. Let

Re: spark-sklearn

2019-04-08 Thread Stephen Boesch
There are several suggestions on this SOF https://stackoverflow.com/questions/38984775/spark-errorexpected-zero-arguments-for-construction-of-classdict-for-numpy-cor 1 You need to convert the final value to a python list. You implement the function as follows: def uniq_array(col_array): x =

Re: Build spark source code with scala 2.11

2019-03-12 Thread Stephen Boesch
You might have better luck downloading the 2.4.X branch Am Di., 12. März 2019 um 16:39 Uhr schrieb swastik mittal : > Then are the mlib of spark compatible with scala 2.12? Or can I change the > spark version from spark3.0 to 2.3 or 2.4 in local spark/master? > > > > -- > Sent from: http://apache

Re: Build spark source code with scala 2.11

2019-03-12 Thread Stephen Boesch
I think scala 2.11 support was removed with the spark3.0/master Am Di., 12. März 2019 um 16:26 Uhr schrieb swastik mittal : > I am trying to build my spark using build/sbt package, after changing the > scala versions to 2.11 in pom.xml because my applications jar files use > scala 2.11. But build

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Stephen Boesch
Erik - is there a current locale for approved/recommended third party additions? The spark-packages has been stale for years it seems. Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson < eerla...@redhat.com>: > Hi Matt! > > There are a couple ways to do this. If you want to submit it for

Re: Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread Stephen Boesch
So the LogisticRegression with regParam and elasticNetParam set to 0 is not what you are looking for? https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#logistic-regression .setRegParam(0.0) .setElasticNetParam(0.0) Am Do., 11. Okt. 2018 um 15:46 Uhr schrieb pikufolgado <

Fixing NullType for parquet files

2018-09-12 Thread Stephen Boesch
#> Permalink <https://issues.apache.org/jira/browse/SPARK-10943?focusedCommentId=16462797&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16462797> [image: javadba]Stephen Boesch <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=javadba> ad

Re: [announce] BeakerX supports Scala+Spark in Jupyter

2018-06-07 Thread Stephen Boesch
Assuming that the spark 2.X kernel (e.g. toree) were chosen for a given jupyter notebook and there is a Cell 3 that contains some Spark DataFrame operations .. Then : - what is the relationship does the %%spark magic and the toree kernel? - how does the %%spark magic get applied to that

Re: Guava dependency issue

2018-05-08 Thread Stephen Boesch
(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6 2018-05-07 10:30 GMT-07:00 Stephen Boesch : > I am intermittently running into guava dependency issues across mutiple > spark projects. I have tried maven shade / relocate but it do

Guava dependency issue

2018-05-07 Thread Stephen Boesch
I am intermittently running into guava dependency issues across mutiple spark projects. I have tried maven shade / relocate but it does not resolve the issues. The current project is extremely simple: *no* additional dependencies beyond scala, spark, and scalatest - yet the issues remain (and yes

[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet

2018-05-03 Thread Stephen Boesch (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462797#comment-16462797 ] Stephen Boesch commented on SPARK-10943: Given the comment by Daniel Davis

Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
to be verified. And maybe *all *collects do require sufficient memory - would you like to check the source code to see if there were disk backed collects actually happening for some cases? 2018-04-28 9:48 GMT-07:00 Deepak Goel : > There is something as *virtual memory* > > On Sat, 28

Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
Do you have a machine with terabytes of RAM? afaik collect() requires RAM - so that would be your limiting factor. 2018-04-28 8:41 GMT-07:00 klrmowse : > i am currently trying to find a workaround for the Spark application i am > working on so that it does not have to use .collect() > > but, fo

Re: Spark.ml roadmap 2.3.0 and beyond

2018-03-20 Thread Stephen Boesch
, make sure to check what > you're currently listed as shepherding!) The links for searching can be > useful too. > > On Thu, Dec 7, 2017 at 3:55 PM, Stephen Boesch wrote: > >> Thanks Joseph. We can wait for post 2.3.0. >> >> 2017-12-07 15:36 GMT-08:00 Joseph

Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Stephen Boesch
While MLLib performed favorably vs Flink it *also *performed favorably vs spark.ml .. and by an *order of magnitude*. The following is one of the tables - it is for Logistic Regression. At that time spark.ML did not yet support SVM From: https://bdataanalytics.biomedcentral.com/articles/10. 118

Re: Anyone know where to find independent contractors in New York?

2017-12-21 Thread Stephen Boesch
Hi Richard, this is not a jobs board: please only discuss spark application development issues. 2017-12-21 8:34 GMT-08:00 Richard L. Burton III : > I'm trying to locate four independent contractors who have experience with > Spark. I'm not sure where I can go to find experienced Spark consultants

Re: GenerateExec, CodegenSupport and supportCodegen flag off?!

2017-12-10 Thread Stephen Boesch
A relevant observation: there was a closed/executed jira last year to remove the option to disable the codegen flag (and unsafe flag as well): https://issues.apache.org/jira/browse/SPARK-11644 2017-12-10 13:16 GMT-08:00 Jacek Laskowski : > Hi, > > I'm wondering why a physical operator like Gener

Re: Spark.ml roadmap 2.3.0 and beyond

2017-12-07 Thread Stephen Boesch
JIRA as well as the few mailing list threads about directions. > > For myself, I'm mainly focusing on fixing some issues with persistence for > custom algorithms in PySpark (done), adding the image schema (done), and > using ML Pipelines in Structured Streaming (WIP). > > Josep

Re: LDA and evaluating topic number

2017-12-07 Thread Stephen Boesch
I have been testing on the 20 NewsGroups dataset - which the Spark docs themselves reference. I can confirm that perplexity increases and likelihood decreases as topics increase - and am similarly confused by these results. 2017-09-28 10:50 GMT-07:00 Cody Buntain : > Hi, all! > > Is there an exa

Re: Spark.ml roadmap 2.3.0 and beyond

2017-11-29 Thread Stephen Boesch
e spark.ml were headed? 2017-11-29 6:39 GMT-08:00 Stephen Boesch : > Any further information/ thoughts? > > > > 2017-11-22 15:07 GMT-08:00 Stephen Boesch : > >> The roadmaps for prior releases e.g. 1.6 2.0 2.1 2.2 were available: >> >> 2.2.0 https://issues.apache

Re: Spark.ml roadmap 2.3.0 and beyond

2017-11-29 Thread Stephen Boesch
Any further information/ thoughts? 2017-11-22 15:07 GMT-08:00 Stephen Boesch : > The roadmaps for prior releases e.g. 1.6 2.0 2.1 2.2 were available: > > 2.2.0 https://issues.apache.org/jira/browse/SPARK-18813 > > 2.1.0 https://issues.apache.org/jira/browse/SPARK-15581 > ..

Spark.ml roadmap 2.3.0 and beyond

2017-11-22 Thread Stephen Boesch
The roadmaps for prior releases e.g. 1.6 2.0 2.1 2.2 were available: 2.2.0 https://issues.apache.org/jira/browse/SPARK-18813 2.1.0 https://issues.apache.org/jira/browse/SPARK-15581 .. It seems those roadmaps were not available per se' for 2.3.0 and later? Is there a different mechanism for that

Weight column values not used in Binary Logistic Regression Summary

2017-11-18 Thread Stephen Boesch
In BinaryLogisticRegressionSummary there are @Since("1.5.0") tags on a number of comments identical to the following: * @note This ignores instance weights (setting all to 1.0) from `LogisticRegression.weightCol`. * This will change in later Spark versions. Are there any plans to address this? O

Re: Spark streaming for CEP

2017-10-24 Thread Stephen Boesch
Hi Mich, the github link has a brief intro - including a link to the formal docs http://logisland.readthedocs.io/en/latest/index.html . They have an architectural overview, developer guide, tutorial, and pretty comprehensive api docs. 2017-10-24 13:31 GMT-07:00 Mich Talebzadeh : > thanks Thomas

Re: Add a machine learning algorithm to sparkml

2017-10-20 Thread Stephen Boesch
A couple of less obvious facets of getting over the (significant!) hurdle to have an algorithm accepted into mllib (/spark.ml): - the review time can be *very *long - a few to many months is a typical case even for relatively fast tracked algorithms - you will likely be asked to provide

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Stephen Boesch
@Vadim Would it be true to say the `.rdd` *may* be creating a new job - depending on whether the DataFrame/DataSet had already been materialized via an action or checkpoint? If the only prior operations on the DataFrame had been transformations then the dataframe would still not have been calcu

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
maven repo - The local maven repo is included by default - so should not need to do anything special there The same errors from the original post continue to occur. 2017-10-11 20:05 GMT-07:00 Stephen Boesch : > A clarification here: the example is being run *from the Spark codebase*. >

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
maven repo in SBT? > > -Paul > > Sent from my iPhone > > On Oct 11, 2017, at 5:48 PM, Stephen Boesch wrote: > > When attempting to run any example program w/ Intellij I am running into > guava versioning issues: > > Exception in thread "main" java.lan

Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
When attempting to run any example program w/ Intellij I am running into guava versioning issues: Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/cache/CacheLoader at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73) at org.apache.spark.SparkConf.

[tesseract-ocr] Newbie: wondering why a fairly crisp document has such low accuracy

2017-08-12 Thread Stephen Boesch
I printed out the "Welcome" page on my HP laserjet printer and scanned it in using .png . The quality is quite good. So I had been anticipating maybe 85%+ accuracy on the tesseract-OCR. I did not even bother to tally carefullly - but by eyeballing it seems about 50%.I had used all defaul

Re: SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
ent from my iPhone > Pardon the dumb thumb typos :) > > > On Aug 10, 2017, at 1:46 PM, Stephen Boesch wrote: > > > > > > While the DataFrame/DataSets are useful in many circumstances they are > cumbersome for many types of complex sql queries. > > > >

SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
While the DataFrame/DataSets are useful in many circumstances they are cumbersome for many types of complex sql queries. Is there an up to date *SQL* reference - i.e. not DataFrame DSL operations - for version 2.2? An example of what is not clear: what constructs are supported within select

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Stephen Boesch
Spark SQL did not support explicit partitioners even before tungsten: and often enough this did hurt performance. Even now Tungsten will not do the best job every time: so the question from the OP is still germane. 2017-06-25 19:18 GMT-07:00 Ryan : > Why would you like to do so? I think there's

Re: Using SparkContext in Executors

2017-05-28 Thread Stephen Boesch
You would need to use *native* Cassandra API's in each Executor - not org.apache.spark.sql.cassandra.CassandraSQLContext - including to create a separate Cassandra connection on each Executor. 2017-05-28 15:47 GMT-07:00 Abdulfattah Safa : > So I can't run SQL queries in Executors ? > > On Sun, M

Re: Jupyter spark Scala notebooks

2017-05-17 Thread Stephen Boesch
Jupyter with toree works well for my team. Jupyter is well more refined vs zeppelin as far as notebook features and usability: shortcuts, editing,etc. The caveat is it is better to run a separate server instanace for python/pyspark vs scala/spark 2017-05-17 19:27 GMT-07:00 Richard Moorhead : >

pyspark in intellij

2017-02-25 Thread Stephen Boesch
Anyone have this working - either in 1.X or 2.X? thanks

Re: Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

2017-02-18 Thread Stephen Boesch
For now I have added to the log4j.properties: log4j.logger.org.apache.parquet=ERROR 2017-02-18 11:50 GMT-08:00 Stephen Boesch : > The following JIRA mentions that a fix made to read parquet 1.6.2 into 2.X > STILL leaves an "avalanche" of warnings: > > > https://issu

Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

2017-02-18 Thread Stephen Boesch
The following JIRA mentions that a fix made to read parquet 1.6.2 into 2.X STILL leaves an "avalanche" of warnings: https://issues.apache.org/jira/browse/SPARK-17993 Here is the text inside one of the last comments before it was merged: I have built the code from the PR and it indeed succeed

Re: Anybody hiring mess experience engineers?

2017-02-04 Thread Stephen Boesch
Please take job inquiries/offers off of the main channel. thanks. 2017-02-04 12:19 GMT-08:00 Vaibhav Khanduja : > Thanks Brock. > > Since I am based in Santa Clara, CA. I was wondering if anything is > located local. Skills you need tough definitely match with me – Spark, HPC > etc. > > From: Br

Re: MLlib mission and goals

2017-01-24 Thread Stephen Boesch
re: spark-packages.org and "Would these really be better in the core project?" That was not at all the intent of my input: instead to ask "how and where to structure/place deployment quality code that yet were *not* part of the distribution?" The spark packages has no curation whatsoever : no

Re: MLlib mission and goals

2017-01-23 Thread Stephen Boesch
Along the lines of #1: the spark packages seemed to have had a good start about two years ago: but now there are not more than a handful in general use - e.g. databricks CSV. When the available packages are browsed the majority are incomplete, empty, unmaintained, or unclear. Any ideas on how to

Re: Latest keyboard shortcuts

2017-01-18 Thread Stephen Boesch
bump. Without keyboard shortcuts a notebook is nearly unusable: certainly they must exist just where is a document for same? thanks. 2017-01-17 9:58 GMT-08:00 Stephen Boesch : > There was an old jira for keyboard shortcuts. But there did not appear to > be an associated document >

Latest keyboard shortcuts

2017-01-17 Thread Stephen Boesch
There was an old jira for keyboard shortcuts. But there did not appear to be an associated document https://issues.apache.org/jira/browse/ZEPPELIN-391 Is there a comprehensive cheat-sheet for the shortcuts? Especially to compare to the excellent jupyter keyboard shortcuts; e.g. dd to delete a ce

Re: Spark/Mesos with GPU support

2016-12-30 Thread Stephen Boesch
Would it be possible to share that communication? I am interested in this thread. 2016-12-30 11:02 GMT-08:00 Ji Yan : > Thanks Michael, Tim and I have touched base and thankfully the issue has > already been resolved > > On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt > wrote: > >> I've cc'd T

Re: Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
This problem appears to be a regression on HEAD/master: when running against 2.0.2 the pyspark job completes successfully including running predictions. 2016-11-23 19:36 GMT-08:00 Stephen Boesch : > > For a pyspark job with 54 executors all of the task outputs have a single > line in

Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
For a pyspark job with 54 executors all of the task outputs have a single line in both the stderr and stdout similar to: Error: invalid log directory /shared/sparkmaven/work/app-20161119222540-/0/ Note: the directory /shared/sparkmaven/work exists and is owned by the same user running the jo

Re: Use experience and performance data of offheap from Alibaba online cluster

2016-11-20 Thread Stephen Boesch
path > offheap in future 1.x release? Let me create one JIRA about this and let's > discuss in the JIRA system. And to be very clear, it's a big YES to share > our patches with all rather than only numbers, just which way is better > (smile). > > And answers for @Stephe

Re: Use experience and performance data of offheap from Alibaba online cluster

2016-11-20 Thread Stephen Boesch
path > offheap in future 1.x release? Let me create one JIRA about this and let's > discuss in the JIRA system. And to be very clear, it's a big YES to share > our patches with all rather than only numbers, just which way is better > (smile). > > And answers for @Stephe

Re: Use experience and performance data of offheap from Alibaba online cluster

2016-11-20 Thread Stephen Boesch
that hedge read is very useful for reducing > latency). So I think the peak throughput is true. > > There are more than 600 million people in China that use internet. So if > they decide to do something to your system at the same time, it looks like > a DDOS to your system... > > T

Re: Use experience and performance data of offheap from Alibaba online cluster

2016-11-19 Thread Stephen Boesch
Repeating my earlier question: 20*Meg* queries per second?? Just checked and *google* does 40*K* queries per second. Now maybe the "queries" are a decomposition of far fewer end-user queries that cause a fanout of backend queries. *But still .. * So maybe please check your numbers again. 2016-1

Re: HPC with Spark? Simultaneous, parallel one to one mapping of partition to vcore

2016-11-19 Thread Stephen Boesch
While "apparently" saturating the N available workers using your proposed N partitions - the "actual" distribution of workers to tasks is controlled by the scheduler. If my past experience were of service - you can *not *trust the default Fair Scheduler to ensure the round-robin scheduling of the

Re: Use experience and performance data of offheap from Alibaba online cluster

2016-11-19 Thread Stephen Boesch
Am I correct in deducing there were on the order of 1.5-2.0 *trillion* queries in a 24 hour span? 2016-11-18 23:35 GMT-08:00 Anoop John : > Because of some compatibility issues, we decide that this will be done > in 2.0 only.. Ya as Andy said, it would be great to share the 1.x > backported patc

Spark-packages

2016-11-06 Thread Stephen Boesch
What is the state of the spark-packages project(s) ? When running a query for machine learning algorithms the results are not encouraging. https://spark-packages.org/?q=tags%3A%22Machine%20Learning%22 There are 62 packages. Only a few have actual releases - and even less with dates in the past

Re: Use BLAS object for matrix operation

2016-11-03 Thread Stephen Boesch
It is private. You will need to put your code in that same package or create an accessor to it living within that package private[spark] 2016-11-03 16:04 GMT-07:00 Yanwei Zhang : > I would like to use some matrix operations in the BLAS object defined in > ml.linalg. But for some reason, spark s

Re: Aggregation Calculation

2016-11-03 Thread Stephen Boesch
You would likely want to create inline views that perform the filtering *before *performing t he cubes/rollup; in this way the cubes/rollups only operate on the pruned rows/columns. 2016-11-03 11:29 GMT-07:00 Andrés Ivaldi : > Hello, I need to perform some aggregations and a kind of Cube/RollUp >

Re: Organizing Spark ML example packages

2016-09-12 Thread Stephen Boesch
Yes: will you have cycles to do it? 2016-09-12 9:09 GMT-07:00 Nick Pentreath : > Never actually got around to doing this - do folks still think it > worthwhile? > > On Thu, 21 Apr 2016 at 00:10 Joseph Bradley wrote: > >> Sounds good to me. I'd request we be strict during this process about >> r

[jira] [Commented] (YARN-3249) Add a "kill application" button to Resource Manager's Web UI

2016-08-22 Thread Stephen Boesch (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431738#comment-15431738 ] Stephen Boesch commented on YARN-3249: -- I have the same question as Xiaoyong Zhu:

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-08-14 Thread Stephen Boesch (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420467#comment-15420467 ] Stephen Boesch commented on SPARK-2243: --- Given this were not going to be f

Re: what's the pronunciation of "MESOS"?

2016-08-09 Thread Stephen Boesch
@Jared / Yu Wei: Mesos is essentially a spanish word: so MAY-sos would travel well. 2016-08-09 11:35 GMT-07:00 Ken Sipe : > Apparently it depends on if you are British or not :) > http://dictionary.cambridge.org/us/pronunciation/english/the-mesosphere > > apparently the absence of “phere" chang

Re: Logging trait in Spark 2.0

2016-06-28 Thread Stephen Boesch
I also did not understand why the Logging class was made private in Spark 2.0. In a couple of projects including CaffeOnSpark the Logging class was simply copied to the new project to allow for backwards compatibility. 2016-06-28 18:10 GMT-07:00 Michael Armbrust : > I'd suggest using the slf4j A

Custom Optimizer

2016-06-23 Thread Stephen Boesch
My team has a custom optimization routine that we would have wanted to plug in as a replacement for the default LBFGS / OWLQN for use by some of the ml/mllib algorithms. However it seems the choice of optimizer is hard-coded in every algorithm except LDA: and even in that one it is only a choice

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
out.write(Opcodes.REDUCE) ^ 2016-06-22 23:49 GMT-07:00 Stephen Boesch : > Thanks Jeff - I remember that now from long time ago. After making that > change the next errors are: > > Error:scalac: missing or invalid dependency detected while loading class > file 'RDD

Re: Building Spark 2.X in Intellij

2016-06-22 Thread Stephen Boesch
16-06-22 23:39 GMT-07:00 Jeff Zhang : > You need to > spark/external/flume-sink/target/scala-2.11/src_managed/main/compiled_avro > under build path, this is the only thing you need to do manually if I > remember correctly. > > > > On Thu, Jun 23, 2016 at 2:30 PM, St

Re: Building Spark 2.X in Intellij

2016-06-22 Thread Stephen Boesch
> It works well with me. You can try reimport it into intellij. > > On Thu, Jun 23, 2016 at 10:25 AM, Stephen Boesch > wrote: > >> >> Building inside intellij is an ever moving target. Anyone have the >> magical procedures to get it going for 2.X? >> >>

Building Spark 2.X in Intellij

2016-06-22 Thread Stephen Boesch
Building inside intellij is an ever moving target. Anyone have the magical procedures to get it going for 2.X? There are numerous library references that - although included in the pom.xml build - are for some reason not found when processed within Intellij.

Notebook(s) for Spark 2.0 ?

2016-06-20 Thread Stephen Boesch
Having looked closely at Jupyter, Zeppelin, and Spark-Notebook : only the latter seems to be close to having support for Spark 2.X. While I am interested in using Spark Notebook as soon as that support were available are there alternatives that work *now*? For example some unmerged -yet -working

Data Generators mllib -> ml

2016-06-20 Thread Stephen Boesch
There are around twenty data generators in mllib -none of which are presently migrated to ml. Here is an example /** * :: DeveloperApi :: * Generate sample data used for SVM. This class generates uniform random values * for the features and adds Gaussian noise with weight 0.1 to generate label

Re: Python to Scala

2016-06-17 Thread Stephen Boesch
What are you expecting us to do? Yash provided a reasonable approach - based on the info you had provided in prior emails. Otherwise you can convert it from python to spark - or find someone else who feels comfortable to do it. That kind of inquiry would likelybe appropriate on a job board. 2

Re: [DISCUSS] Java 8 as a minimum requirement

2016-06-16 Thread Stephen Boesch
@Jeff Klukas What is the concern about scala 2.11 vs 2.12? 2.11 runs on both java7 and java8 2016-06-16 14:12 GMT-07:00 Jeff Klukas : > Would the move to Java 8 be for all modules? I'd have some concern about > removing Java 7 compatibility for kafka-clients and for kafka streams > (though less

Re: [DISCUSS] Java 8 as a minimum requirement

2016-06-16 Thread Stephen Boesch
@Jeff Klukas What is the concern about scala 2.11 vs 2.12? 2.11 runs on both java7 and java8 2016-06-16 14:12 GMT-07:00 Jeff Klukas : > Would the move to Java 8 be for all modules? I'd have some concern about > removing Java 7 compatibility for kafka-clients and for kafka streams > (though less

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Stephen Boesch
How many workers (/cpu cores) are assigned to this job? 2016-06-09 13:01 GMT-07:00 SRK : > Hi, > > How to insert data into 2000 partitions(directories) of ORC/parquet at a > time using Spark SQL? It seems to be not performant when I try to insert > 2000 directories of Parquet/ORC using Spark SQL

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Stephen Boesch
ooc are the tables partitioned on a.pk and b.fk? Hive might be using copartitioning in that case: it is one of hive's strengths. 2016-06-09 7:28 GMT-07:00 Gourav Sengupta : > Hi Mich, > > does not Hive use map-reduce? I thought it to be so. And since I am > running the queries in EMR 4.6 therefo

  1   2   3   4   5   6   >