Re: Dataproc serverless for Spark

2022-11-21 Thread Stephen Boesch
Out of curiosity : are there functional limitations in Spark Standalone
that are of concern?  Yarn is more configurable for running non-spark
workloads and how to run multiple spark jobs in parallel. But for a single
spark job it seems standalone launches more quickly and does not miss any
features. Are there specific limitations you are aware of / run into?

stephen b

On Mon, 21 Nov 2022 at 09:01, Mich Talebzadeh 
wrote:

> Hi,
>
> I have not tested this myself but Google have brought up *Dataproc Serverless
> for Spar*k. in a nutshell Dataproc Serverless lets you run Spark batch
> workloads without requiring you to provision and manage your own cluster.
> Specify workload parameters, and then submit the workload to the Dataproc
> Serverless service. The service will run the workload on a managed compute
> infrastructure, autoscaling resources as needed. Dataproc Serverless
> charges apply only to the time when the workload is executing. Google
> Dataproc is similar to Amazon EMR
>
> So in short you don't need to provision your own Dataproc cluster etc. One
> thing Inoticed from release doc
> is that the
> resource management is *spark based a*s opposed to standard Dataproc
> which iis YARN based. It is available for Spark 3.2. My assumption is
> that by Spark based it means that spark is running in standalone mode. Has
> there been much improvement in release 3.2 for standalone mode?
>
> Thanks
>
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Spark Scala Contract Opportunity @USA

2022-11-10 Thread Stephen Boesch
Please do not send advertisements on this channel.

On Thu, 10 Nov 2022 at 13:40, sri hari kali charan Tummala <
kali.tumm...@gmail.com> wrote:

> Hi All,
>
> Is anyone looking for a spark scala contract role inside the USA? A
> company called Maxonic has an open spark scala contract position (100%
> remote) inside the USA if anyone is interested, please send your CV to
> kali.tumm...@gmail.com.
>
> Thanks & Regards
> Sri Tummala
>
>


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch
I agree with Wim's assessment of data engineering / ETL vs Data Science.
I wrote pipelines/frameworks for large companies and scala was a much
better choice. But for ad-hoc work interfacing directly with data science
experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
wrote:

> Many thanks everyone for their valuable contribution.
>
> We all started with Spark a few years ago where Scala was the talk of the
> town. I agree with the note that as long as Spark stayed nish and elite,
> then someone with Scala knowledge was attracting premiums. In fairness in
> 2014-2015, there was not much talk of Data Science input (I may be wrong).
> But the world has moved on so to speak. Python itself has been around
> a long time (long being relative here). Most people either knew UNIX Shell,
> C, Python or Perl or a combination of all these. I recall we had a director
> a few years ago who asked our Hadoop admin for root password to log in to
> the edge node. Later he became head of machine learning somewhere else and
> he loved C and Python. So Python was a gift in disguise. I think Python
> appeals to those who are very familiar with CLI and shell programming (Not
> GUI fan). As some members alluded to there are more people around with
> Python knowledge. Most managers choose Python as the unifying development
> tool because they feel comfortable with it. Frankly I have not seen a
> manager who feels at home with Scala. So in summary it is a bit
> disappointing to abandon Scala and switch to Python just for the sake of it.
>
> Disclaimer: These are opinions and not facts so to speak :)
>
> Cheers,
>
>
> Mich
>
>
>
>
>
>
> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala is
>> because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark with
>> Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to get
>> some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Kotlin Spark API

2020-07-14 Thread Stephen Boesch
I just looked at the examples.
https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples
These look v nice!  V concise yet flexible.  I like the ability to do
inline *side-effects.  *E.g. caching or printing or showDs()

package org.jetbrains.spark.api.examples
import org.apache.spark.sql.Row
import org.jetbrains.spark.api.*

fun main() {
withSpark {
val sd = dsOf(1, 2, 3)
sd.createOrReplaceTempView("ds")
spark.sql("select * from ds")
.withCached {
println("asList: ${toList()}")
println("asArray: ${toArray().contentToString()}")
this
}
.to()
.withCached {
println("typed collect: " + (collect() as
Array).contentToString())
println("type collectAsList: " + collectAsList())
}

dsOf(1, 2, 3)
.map { c(it, it + 1, it + 2) }
.to()
.select("_1")
.collectAsList()
.forEach { println(it) }
}
}


So that shows some of the niceness of kotlin: intuitive type conversion
`to`/`to` and `dsOf( list)`- and also the inlining of the side
effects. Overall concise and pleasant to read.


On Tue, 14 Jul 2020 at 12:18, Stephen Boesch  wrote:

> I started with scala/spark in 2012 and scala has been my go-to language
> for six years. But I heartily applaud this direction. Kotlin is more like a
> simplified Scala - with the benefits that brings - than a simplified java.
> I particularly like the simplified / streamlined collections classes.
>
> Really looking forward to this development.
>
> On Tue, 14 Jul 2020 at 10:42, Maria Khalusova  wrote:
>
>> Hi folks,
>>
>> We would love your feedback on the new Kotlin Spark API that we are
>> working on: https://github.com/JetBrains/kotlin-spark-api.
>>
>> Why Kotlin Spark API? Kotlin developers can already use Kotlin with the
>> existing Apache Spark Java API, however they cannot take full advantage of
>> Kotlin language features. With Kotlin Spark API, you can use Kotlin data
>> classes and lambda expressions.
>>
>> The API also adds some helpful extension functions. For example, you can
>> use `withCached` to perform arbitrary transformations on a Dataset and not
>> worry about the Dataset unpersisting at the end.
>>
>> If you like Kotlin and would like to try the API, we've prepared a Quick
>> Start Guide to help you set up all the needed dependencies in no time using
>> either Maven or Gradle:
>> https://github.com/JetBrains/kotlin-spark-api/blob/master/docs/quick-start-guide.md
>>
>> In the repo, you’ll also find a few code examples to get an idea of what
>> the API looks like:
>> https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples
>>
>> We’d love to see your feedback in the project’s GitHub issues:
>> https://github.com/JetBrains/kotlin-spark-api/issues.
>>
>>
>> Thanks!
>>
>>
>>


Re: Kotlin Spark API

2020-07-14 Thread Stephen Boesch
I started with scala/spark in 2012 and scala has been my go-to language for
six years. But I heartily applaud this direction. Kotlin is more like a
simplified Scala - with the benefits that brings - than a simplified java.
I particularly like the simplified / streamlined collections classes.

Really looking forward to this development.

On Tue, 14 Jul 2020 at 10:42, Maria Khalusova  wrote:

> Hi folks,
>
> We would love your feedback on the new Kotlin Spark API that we are
> working on: https://github.com/JetBrains/kotlin-spark-api.
>
> Why Kotlin Spark API? Kotlin developers can already use Kotlin with the
> existing Apache Spark Java API, however they cannot take full advantage of
> Kotlin language features. With Kotlin Spark API, you can use Kotlin data
> classes and lambda expressions.
>
> The API also adds some helpful extension functions. For example, you can
> use `withCached` to perform arbitrary transformations on a Dataset and not
> worry about the Dataset unpersisting at the end.
>
> If you like Kotlin and would like to try the API, we've prepared a Quick
> Start Guide to help you set up all the needed dependencies in no time using
> either Maven or Gradle:
> https://github.com/JetBrains/kotlin-spark-api/blob/master/docs/quick-start-guide.md
>
> In the repo, you’ll also find a few code examples to get an idea of what
> the API looks like:
> https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples
>
> We’d love to see your feedback in the project’s GitHub issues:
> https://github.com/JetBrains/kotlin-spark-api/issues.
>
>
> Thanks!
>
>
>


Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Stephen Boesch
Spark in local mode (which is different than standalone) is a solution for
many use cases. I use it in conjunction with (and sometimes instead of)
pandas/pandasql due to its much wider ETL related capabilities. On the JVM
side it is an even more obvious choice - given there is no equivalent to
pandas and it has even better performance.

It is also a strong candidate due to the expressiveness of the sql dialect
including support for analytical/windowing functions.There is a latency
hit: on the order of a couple of seconds to start the SparkContext - but
pandas is not a high performance tool in any case.

i see that OpenRefine is implemented in Java so then Spark local should  be
a very good complement to it.


On Sat, 4 Jul 2020 at 08:17, Antonin Delpeuch (lists) <
li...@antonin.delpeuch.eu> wrote:

> Hi,
>
> I am working on revamping the architecture of OpenRefine, an ETL tool,
> to execute workflows on datasets which do not fit in RAM.
>
> Spark's RDD API is a great fit for the tool's operations, and provides
> everything we need: partitioning and lazy evaluation.
>
> However, OpenRefine is a lightweight tool that runs locally, on the
> users' machine, and we want to preserve this use case. Running Spark in
> standalone mode works, but I have read at a couple of places that the
> standalone mode is only intended for development and testing. This is
> confirmed by my experience with it so far:
> - the overhead added by task serialization and scheduling is significant
> even in standalone mode. This makes sense for testing, since you want to
> test serialization as well, but to run Spark in production locally, we
> would need to bypass serialization, which is not possible as far as I know;
> - some bugs that manifest themselves only in local mode are not getting
> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
> it seems dangerous to base a production system on standalone Spark.
>
> So, we cannot use Spark as default runner in the tool. Do you know any
> alternative which would be designed for local use? A library which would
> provide something similar to the RDD API, but for parallelization with
> threads in the same JVM, not machines in a cluster?
>
> If there is no such thing, it should not be too hard to write our
> homegrown implementation, which would basically be Java streams with
> partitioning. I have looked at Apache Beam's direct runner, but it is
> also designed for testing so does not fit our bill for the same reasons.
>
> We plan to offer a Spark-based runner in any case - but I do not think
> it can be used as the default runner.
>
> Cheers,
> Antonin
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Hey good looking toPandas ()

2020-06-19 Thread Stephen Boesch
afaik It has been there since  Spark 2.0 in 2015.   Not certain about Spark
1.5/1.6

On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan 
wrote:

> I first ran the  command
> df.show()
>
> For sanity check of my dataFrame.
>
> I wasn't impressed with the display.
>
> I then ran
> df.toPandas() in Jupiter Notebook.
>
> Now the display is really good looking .
>
> Is toPandas() a new function which became available in Spark 3.0 ?
>
>
>
>
>
>


Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
I neglected to include the rationale: the assumption is this will be a
repeatedly needed process thus a reusable method were helpful.  The
predicate/input rules that are supported will need to be flexible enough to
support the range of input data domains and use cases .  For my workflows
the predicates are typically sql's.

Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch :

> Hi Mich!
>I think you can combine the good/rejected into one method that
> internally:
>
>- Create good/rejected df's given an input df and input
>rules/predicates to apply to the df.
>- Create a third df containing the good rows and the rejected rows
>with the bad columns nulled out
>- Append/insert the two dfs into their respective hive good/exception
>tables
>- return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)
>or maybe just the (combinedDf,exceptionsDf)
>
>
> Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>>
>> Hi,
>>
>> I have a Spark Scala program created and compiled with Maven. It works
>> fine. It basically does the following:
>>
>>
>>1. Reads an xml file from HDFS location
>>2. Creates a DF on top of what it reads
>>3. Creates a new DF with some columns renamed etc
>>4. Creates a new DF for rejected rows (incorrect value for a column)
>>5. Puts rejected data into Hive exception table
>>6. Puts valid rows into Hive main table
>>7. Nullifies the invalid rows by setting the invalid column to NULL
>>and puts the rows into the main Hive table
>>
>> These are currently performed in one method. Ideally I want to break this
>> down as follows:
>>
>>
>>1. A method to read the XML file and creates DF and a new DF on top
>>of previous DF
>>2. A method to create a DF on top of rejected rows using t
>>3. A method to put invalid rows into the exception table using tmp
>>table
>>4. A method to put the correct rows into the main table again using
>>tmp table
>>
>> I was wondering if this is correct approach?
>>
>> Thanks,
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
Hi Mich!
   I think you can combine the good/rejected into one method that
internally:

   - Create good/rejected df's given an input df and input rules/predicates
   to apply to the df.
   - Create a third df containing the good rows and the rejected rows with
   the bad columns nulled out
   - Append/insert the two dfs into their respective hive good/exception
   tables
   - return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)
   or maybe just the (combinedDf,exceptionsDf)


Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <
mich.talebza...@gmail.com>:

>
> Hi,
>
> I have a Spark Scala program created and compiled with Maven. It works
> fine. It basically does the following:
>
>
>1. Reads an xml file from HDFS location
>2. Creates a DF on top of what it reads
>3. Creates a new DF with some columns renamed etc
>4. Creates a new DF for rejected rows (incorrect value for a column)
>5. Puts rejected data into Hive exception table
>6. Puts valid rows into Hive main table
>7. Nullifies the invalid rows by setting the invalid column to NULL
>and puts the rows into the main Hive table
>
> These are currently performed in one method. Ideally I want to break this
> down as follows:
>
>
>1. A method to read the XML file and creates DF and a new DF on top of
>previous DF
>2. A method to create a DF on top of rejected rows using t
>3. A method to put invalid rows into the exception table using tmp
>table
>4. A method to put the correct rows into the main table again using
>tmp table
>
> I was wondering if this is correct approach?
>
> Thanks,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Going it alone.

2020-04-16 Thread Stephen Boesch
The warning signs were there from the first email sent from that person. I
wonder is there any way to deal with this more proactively.

Am Do., 16. Apr. 2020 um 10:54 Uhr schrieb Mich Talebzadeh <
mich.talebza...@gmail.com>:

> good for you. right move
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 16 Apr 2020 at 18:33, Sean Owen  wrote:
>
>> Absolutely unacceptable even if this were the only one. I'm contacting
>> INFRA right now.
>>
>> On Thu, Apr 16, 2020 at 11:57 AM Holden Karau 
>> wrote:
>>
>>> I want to be clear I believe the language in janethrope1s email is
>>> unacceptable for the mailing list and possibly a violation of the Apache
>>> code of conduct. I’m glad we don’t see messages like this often.
>>>
>>> I know this is a stressful time for many of us, but let’s try and do our
>>> best to not take it out on others.
>>>
>>> On Wed, Apr 15, 2020 at 11:46 PM Subash Prabakar <
>>> subashpraba...@gmail.com> wrote:
>>>
 Looks like he had a very bad appraisal this year.. Fun fact : the
 coming year would be too :)

 On Thu, 16 Apr 2020 at 12:07, Qi Kang  wrote:

> Well man, check your attitude, you’re way over the line
>
>
> On Apr 16, 2020, at 13:26, jane thorpe 
> wrote:
>
> F*U*C*K O*F*F
> C*U*N*T*S
>
>
> --
> On Thursday, 16 April 2020 Kelvin Qin  wrote:
>
> No wonder I said why I can't understand what the mail expresses, it
> feels like a joke……
>
>
>
>
>
> 在 2020-04-16 02:28:49,seemanto.ba...@nomura.com.INVALID 写道:
>
> Have we been tricked by a bot ?
>
>
> *From:* Matt Smith 
> *Sent:* Wednesday, April 15, 2020 2:23 PM
> *To:* jane thorpe
> *Cc:* dh.lo...@gmail.com; user@spark.apache.org; janethor...@aol.com;
> em...@yeikel.com
> *Subject:* Re: Going it alone.
>
>
> *CAUTION EXTERNAL EMAIL:* DO NOT CLICK ON LINKS OR OPEN ATTACHMENTS
> THAT ARE UNEXPECTED OR SENT FROM UNKNOWN SENDERS. IF IN DOUBT REPORT TO
> SPAM SUBMISSIONS.
>
> This is so entertaining.
>
>
> 1. Ask for help
>
> 2. Compare those you need help from to a lower order primate.
>
> 3. Claim you provided information you did not
>
> 4. Explain that providing any information would be "too revealing"
>
> 5. ???
>
>
> Can't wait to hear what comes next, but please keep it up.  This is a
> bright spot in my day.
>
>
>
> On Tue, Apr 14, 2020 at 4:47 PM jane thorpe <
> janethor...@aol.com.invalid> wrote:
>
> I did write a long email in response to you.
> But then I deleted it because I felt it would be too revealing.
>
>
>
>
> --
>
> On Tuesday, 14 April 2020 David Hesson  wrote:
>
> I want to know  if Spark is headed in my direction.
>
> You are implying  Spark could be.
>
>
>
> What direction are you headed in, exactly? I don't feel as if anything
> were implied when you were asked for use cases or what problem you are
> solving. You were asked to identify some use cases, of which you don't
> appear to have any.
>
>
> On Tue, Apr 14, 2020 at 4:49 PM jane thorpe <
> janethor...@aol.com.invalid> wrote:
>
> That's what  I want to know,  Use Cases.
> I am looking for  direction as I described and I want to know  if
> Spark is headed in my direction.
>
> You are implying  Spark could be.
>
> So tell me about the USE CASES and I'll do the rest.
> --
>
> On Tuesday, 14 April 2020 yeikel valdes  wrote:
>
> It depends on your use case. What are you trying to solve?
>
>
>
>  On Tue, 14 Apr 2020 15:36:50 -0400 *janethor...@aol.com.INVALID
>  *wrote 
>
> Hi,
>
> I consider myself to be quite good in Software Development especially
> using frameworks.
>
> I like to get my hands  dirty. I have spent the last few months
> understanding modern frameworks and architectures.
>
> I am looking to invest my energy in a product where I don't have to
> relying on the monkeys which occupy this space  we call software
> development.
>
> I have found one that meets my requirements.
>
> Would Apache Spark be a good Tool for me or  do I 

Re: IDE suitable for Spark

2020-04-07 Thread Stephen Boesch
I have been using  Idea for both scala/spark and pyspark projects since
2013. It required fair amount of fiddling that first year but has been
stable since early 2015.   For pyspark projects only Pycharm naturally also
works v well.

Am Di., 7. Apr. 2020 um 09:10 Uhr schrieb yeikel valdes :

>
> Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it
> is missing a lot of the features that we expect from an IDE.
>
> Thanks for sharing though.
>
>  On Tue, 07 Apr 2020 04:45:33 -0400 * zahidr1...@gmail.com
>  * wrote 
>
> When I first logged on I asked if there was a suitable IDE for Spark.
> I did get a couple of responses.
> *Thanks.*
>
> I did actually find one which is suitable IDE for spark.
> That is  *Apache Zeppelin.*
>
> One of many reasons it is suitable for Apache Spark is.
> The  *up and running Stage* which involves typing *bin/zeppelin-daemon.sh
> start*
> Go to browser and type *http://localhost:8080 *
> That's it!
>
> Then to
> * Hit the ground running*
> There are also ready to go Apache Spark examples
> showing off the type of functionality one will be using in real life
> production.
>
> Zeppelin comes with  embedded Apache Spark  and scala as default
> interpreter with 20 + interpreters.
> I have gone on to discover there are a number of other advantages for real
> time production
> environment with Zeppelin offered up by other Apache Products.
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>
>
>


Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Ok. Can't think of why that would happen.

Am Di., 10. Sept. 2019 um 20:26 Uhr schrieb Dhrubajyoti Hati <
dhruba.w...@gmail.com>:

> As mentioned in the very first mail:
> * same cluster it is submitted.
> * from same machine they are submitted and also from same user
> * each of them has 128 executors and 2 cores per executor with 8Gigs of
> memory each and both of them are getting that while running
>
> to clarify more let me quote what I mentioned above. *These data is taken
> from Spark-UI when the jobs are almost finished in both.*
> "What i found is the  the quantile values for median for one ran with
> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which
> means per task time taken is much higher in spark-submit script than
> jupyter script. This is where I am really puzzled because they are the
> exact same code. why running them two different ways vary so much in the
> execution time.
>
>
>
>
> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*
>
>
> On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch  wrote:
>
>> Sounds like you have done your homework to properly compare .   I'm
>> guessing the answer to the following is yes .. but in any case:  are they
>> both running against the same spark cluster with the same configuration
>> parameters especially executor memory and number of workers?
>>
>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>> dhruba.w...@gmail.com>:
>>
>>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>>> compressed base64 encoded text data from a hive table and decompressing and
>>> decoding in one of the udfs. Also the time compared is from Spark UI not
>>> how long the job actually takes after submission. Its just the running time
>>> i am comparing/mentioning.
>>>
>>> As mentioned earlier, all the spark conf params even match in two
>>> scripts and that's why i am puzzled what going on.
>>>
>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>> pmccar...@dstillery.com> wrote:
>>>
>>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>>> already is connected to a running spark context, while spark-submit needs
>>>> to get a new spot in the (YARN?) queue.
>>>>
>>>> I would check the cluster job IDs for both to ensure you're getting new
>>>> cluster tasks for each.
>>>>
>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am facing a weird behaviour while running a python script. Here is
>>>>> what the code looks like mostly:
>>>>>
>>>>> def fn1(ip):
>>>>>some code...
>>>>> ...
>>>>>
>>>>> def fn2(row):
>>>>> ...
>>>>> some operations
>>>>> ...
>>>>> return row1
>>>>>
>>>>>
>>>>> udf_fn1 = udf(fn1)
>>>>> cdf = spark.read.table("") //hive table is of size > 500 Gigs with
>>>>> ~4500 partitions
>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>>>> .drop("colz") \
>>>>> .withColumnRenamed("colz", "coly")
>>>>>
>>>>> edf = ddf \
>>>>> .filter(ddf.colp == 'some_value') \
>>>>> .rdd.map(lambda row: fn2(row)) \
>>>>> .toDF()
>>>>>
>>>>> print edf.count() // simple way for the performance test in both
>>>>> platforms
>>>>>
>>>>> Now when I run the same code in a brand new jupyter notebook it runs
>>>>> 6x faster than when I run this python script using spark-submit. The
>>>>> configurations are printed and  compared from both the platforms and they
>>>>> are exact same. I even tried to run this script in a single cell of 
>>>>> jupyter
>>>>> notebook and still have the same performance. I need to understand if I am
>>>>> missing something in the spark-submit which is causing the issue.  I tried
>>>>> to minimise the script to reproduce the same error without much code.
>>>>>
>>>>> Both are run in client mode on a yarn based spark cluster. The
>>>>> machines from which both are executed are also the same and from same 
>>>>> user.
>>>>>
>>>>> What i found is the  the quantile values for median for one ran with
>>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am 
>>>>> not
>>>>> able to figure out why this is happening.
>>>>>
>>>>> Any one faced this kind of issue before or know how to resolve this?
>>>>>
>>>>> *Regards,*
>>>>> *Dhrub*
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Patrick McCarthy  *
>>>>
>>>> Senior Data Scientist, Machine Learning Engineering
>>>>
>>>> Dstillery
>>>>
>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>
>>>


Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Sounds like you have done your homework to properly compare .   I'm
guessing the answer to the following is yes .. but in any case:  are they
both running against the same spark cluster with the same configuration
parameters especially executor memory and number of workers?

Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
dhruba.w...@gmail.com>:

> No, i checked for that, hence written "brand new" jupyter notebook. Also
> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
> compressed base64 encoded text data from a hive table and decompressing and
> decoding in one of the udfs. Also the time compared is from Spark UI not
> how long the job actually takes after submission. Its just the running time
> i am comparing/mentioning.
>
> As mentioned earlier, all the spark conf params even match in two scripts
> and that's why i am puzzled what going on.
>
> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, 
> wrote:
>
>> It's not obvious from what you pasted, but perhaps the juypter notebook
>> already is connected to a running spark context, while spark-submit needs
>> to get a new spot in the (YARN?) queue.
>>
>> I would check the cluster job IDs for both to ensure you're getting new
>> cluster tasks for each.
>>
>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
>> wrote:
>>
>>> Hi,
>>>
>>> I am facing a weird behaviour while running a python script. Here is
>>> what the code looks like mostly:
>>>
>>> def fn1(ip):
>>>some code...
>>> ...
>>>
>>> def fn2(row):
>>> ...
>>> some operations
>>> ...
>>> return row1
>>>
>>>
>>> udf_fn1 = udf(fn1)
>>> cdf = spark.read.table("") //hive table is of size > 500 Gigs with
>>> ~4500 partitions
>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>> .drop("colz") \
>>> .withColumnRenamed("colz", "coly")
>>>
>>> edf = ddf \
>>> .filter(ddf.colp == 'some_value') \
>>> .rdd.map(lambda row: fn2(row)) \
>>> .toDF()
>>>
>>> print edf.count() // simple way for the performance test in both
>>> platforms
>>>
>>> Now when I run the same code in a brand new jupyter notebook it runs 6x
>>> faster than when I run this python script using spark-submit. The
>>> configurations are printed and  compared from both the platforms and they
>>> are exact same. I even tried to run this script in a single cell of jupyter
>>> notebook and still have the same performance. I need to understand if I am
>>> missing something in the spark-submit which is causing the issue.  I tried
>>> to minimise the script to reproduce the same error without much code.
>>>
>>> Both are run in client mode on a yarn based spark cluster. The machines
>>> from which both are executed are also the same and from same user.
>>>
>>> What i found is the  the quantile values for median for one ran with
>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
>>> able to figure out why this is happening.
>>>
>>> Any one faced this kind of issue before or know how to resolve this?
>>>
>>> *Regards,*
>>> *Dhrub*
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>


Re: Incremental (online) machine learning algorithms on ML

2019-08-05 Thread Stephen Boesch
There are several high bars to getting a new algorithm adopted.

*  It needs to be deemed by the MLLib committers/shepherds as widely useful
to the community.  Algorithms offered by larger companies after having
demonstrated usefulness at scale for   use cases  likely to be encountered
by many other companies stand a better chance
* There is quite limited bandwidth for consideration of new algorithms:
there has been a dearth of new ones accepted since early 2015 . So
prioritization is a challenge.
* The code must demonstrate high quality standards especially wrt
testability, maintainability, computational performance, and scalability.
* The chosen algorithms and options should be well documented and include
comparisons/ tradeoffs with state of the art described in relevant papers.
These questions will typically be asked during design/code reviews - i.e.
did you consider the approach shown *here *
* There is also luck and timing involved. The review process might start in
a given month A but reviewers become busy or higher priorities intervene ..
and then when will the reviewing continue..
* At the point that the above are complete then there are intricacies with
integrating with a particular Spark release

Am Mo., 5. Aug. 2019 um 05:58 Uhr schrieb chagas :

> Hi,
>
> After searching the machine learning library for streaming algorithms, I
> found two that fit the criteria: Streaming Linear Regression
> (
> https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression)
>
> and Streaming K-Means
> (
> https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means
> ).
>
> However, both use the RDD-based API MLlib instead of the DataFrame-based
> API ML; are there any plans for bringing them both to ML?
>
> Also, is there any technical reason why there are so few incremental
> algorithms on the machine learning library? There's only 1 algorithm for
> regression and clustering each, with nothing for classification,
> dimensionality reduction or feature extraction.
>
> If there is a reason, how were those two algorithms implemented? If
> there isn't, what is the general consensus on adding new online machine
> learning algorithms?
>
> Regards,
> Lucas Chagas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


How to execute non-timestamp-based aggregations in spark structured streaming?

2019-04-20 Thread Stephen Boesch
Consider the following *intended* sql:

select row_number()
  over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,*
  from flights

This will *not* work in *structured streaming* : The culprit is:

 partition by Origin

The requirement is to use a timestamp-typed field such as

 partition by flightTime

Tathagata Das (core committer for *spark streaming*) - replies on that in a
nabble thread:

 The traditional SQL windows with `over` is not supported in streaming.
Only time-based windows, that is, `window("timestamp", "10 minutes")` is
supported in streaming

*W**hat then* for my query above - which *must* be based on the *Origin* field?
What is the closest equivalent to that query? Or what would be a workaround
or different approach to achieve same results?


Re: spark-sklearn

2019-04-08 Thread Stephen Boesch
There are several suggestions on this SOF
https://stackoverflow.com/questions/38984775/spark-errorexpected-zero-arguments-for-construction-of-classdict-for-numpy-cor

1

You need to convert the final value to a python list. You implement the
function as follows:

def uniq_array(col_array):
x = np.unique(col_array)
return list(x)

This is because Spark doesn't understand the numpy array format. In order
to feed a python object that Spark DataFrames understand as an ArrayType,
you need to convert the output to a python list before returning it.




The source of the problem is that object returned from the UDF doesn't
conform to the declared type. np.unique not only returns numpy.ndarray but
also converts numerics to the corresponding NumPy types which are not
compatible  with
DataFrame API. You can try something like this:

udf(lambda x: list(set(x)), ArrayType(IntegerType()))

or this (to keep order)

udf(lambda xs: list(OrderedDict((x, None) for x in xs)),
ArrayType(IntegerType()))

instead.

If you really want np.unique you have to convert the output:

udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))













Am Mo., 8. Apr. 2019 um 11:43 Uhr schrieb Sudhir Babu Pothineni <
sbpothin...@gmail.com>:

>
>
>
> Trying to run tests in spark-sklearn, anybody check the below exception
>
> pip freeze:
>
> nose==1.3.7
> numpy==1.16.1
> pandas==0.19.2
> python-dateutil==2.7.5
> pytz==2018.9
> scikit-learn==0.19.2
> scipy==1.2.0
> six==1.12.0
> spark-sklearn==0.3.0
>
> Spark version:
> spark-2.2.3-bin-hadoop2.6/bin/pyspark
>
>
> running into following exception:
>
> ==
> ERROR: test_scipy_sparse (spark_sklearn.converter_test.CSRVectorUDTTests)
> --
> Traceback (most recent call last):
>   File
> "/home/spothineni/Downloads/spark-sklearn-release-0.3.0/python/spark_sklearn/converter_test.py",
> line 83, in test_scipy_sparse
> self.assertEqual(df.count(), 1)
>   File
> "/home/spothineni/Downloads/spark-2.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py",
> line 522, in count
> return int(self._jdf.count())
>   File
> "/home/spothineni/Downloads/spark-2.4.1-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File
> "/home/spothineni/Downloads/spark-2.4.1-bin-hadoop2.6/python/pyspark/sql/utils.py",
> line 63, in deco
> return f(*a, **kw)
>   File
> "/home/spothineni/Downloads/spark-2.4.1-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o652.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 11 in stage 0.0 failed 1 times, most recent failure: Lost task 11.0 in
> stage 0.0 (TID 11, localhost, executor driver):
> net.razorvine.pickle.PickleException: expected zero arguments for
> construction of ClassDict (for numpy.dtype)
> at
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
> at
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188)
> at
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at
> 

Re: Build spark source code with scala 2.11

2019-03-12 Thread Stephen Boesch
You might have better luck downloading the 2.4.X branch

Am Di., 12. März 2019 um 16:39 Uhr schrieb swastik mittal :

> Then are the mlib of spark compatible with scala 2.12? Or can I change the
> spark version from spark3.0 to 2.3 or 2.4 in local spark/master?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Build spark source code with scala 2.11

2019-03-12 Thread Stephen Boesch
I think scala 2.11 support was removed with the spark3.0/master

Am Di., 12. März 2019 um 16:26 Uhr schrieb swastik mittal :

> I am trying to build my spark using build/sbt package, after changing the
> scala versions to 2.11 in pom.xml because my applications jar files use
> scala 2.11. But building the spark code gives an error in sql  saying "A
> method with a varargs annotation produces a forwarder method with the same
> signature (exprs:
> Array[org.apache.spark.sql.Column])org.apache.spark.sql.Column as an
> existing method." in UserDefinedFunction.scala. I even tried building with
> using Dscala parameter to change the version of scala but it gives the same
> error. How do I change the spark and scala version and build the spark
> source code correctly? Any help is appreciated.
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread Stephen Boesch
So the LogisticRegression with regParam and elasticNetParam set to 0 is not
what you are looking for?

https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#logistic-regression

  .setRegParam(0.0)
  .setElasticNetParam(0.0)


Am Do., 11. Okt. 2018 um 15:46 Uhr schrieb pikufolgado <
pikufolg...@gmail.com>:

> Hi,
>
> I would like to carry out a classic logistic regression analysis. In other
> words, without using penalised regression ("glmnet" in R). I have read the
> documentation and am not able to find this kind of models.
>
> Is it possible to estimate this? In R the name of the function is "glm".
>
> Best regards
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Fixing NullType for parquet files

2018-09-12 Thread Stephen Boesch
When this JIRA was opened in 2015 the parquet did not support null types.
I commented on this JIRA in May that - given parquet now does include that
support - can this bug be reopened ?  There was no response. What is the
correct way to request consideration of re-opening this issue?

https://issues.apache.org/jira/browse/SPARK-10943

Permalink
<https://issues.apache.org/jira/browse/SPARK-10943?focusedCommentId=14959304=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14959304>
[image: marmbrus]Michael Armbrust
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=marmbrus> added
a comment - 15/Oct/15 18:00

Yeah, parquet doesn't have a concept of null type. I'd probably suggest
they case null to a type CAST(NULL AS INT) if they really want to do this,
but really you should just omit the column probably.
<https://issues.apache.org/jira/browse/SPARK-10943#>
Permalink
<https://issues.apache.org/jira/browse/SPARK-10943?focusedCommentId=16462244=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16462244>
[image: wabu]Daniel Davis
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=wabu> added a
comment - 03/May/18 10:14

According to parquet data types
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, now a
Null type should be supported. So perhaps this issue should be reconsidered?
<https://issues.apache.org/jira/browse/SPARK-10943#>
Permalink
<https://issues.apache.org/jira/browse/SPARK-10943?focusedCommentId=16462797=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16462797>
[image: javadba]Stephen Boesch
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=javadba> added
a comment - 03/May/18 17:08

Given the comment by Daniel Davis can this issue be reopened?


Re: [announce] BeakerX supports Scala+Spark in Jupyter

2018-06-07 Thread Stephen Boesch
Assuming that the spark 2.X kernel (e.g. toree) were chosen for a given
jupyter notebook and there is a  Cell 3 that contains some Spark DataFrame
operations .. Then :


   - what is the relationship  does the %%spark  magic and the toree kernel?
   - how does the %%spark magic get applied to that other Cell 3 ?

thanks!

2018-06-07 16:33 GMT-07:00 s...@draves.org :

> We are pleased to announce release 0.19.0 of BeakerX ,
> a collection of extensions and kernels for Jupyter and Jupyter Lab.
>
> BeakerX now features Scala+Spark integration including GUI configuration,
> status, progress, interrupt, and interactive tables.
>
> We are very interested in your feedback about what remains to be done.
> You may reach by github and gitter, as documented in the readme:
> https://github.com/twosigma/beakerx
>
> Thanks, -Scott
>
> [image: spark.png]
> ​
>
> --
> BeakerX.com
> ScottDraves.com 
> @Scott_Draves 
>
>


Re: Guava dependency issue

2018-05-08 Thread Stephen Boesch
I downgraded to spark 2.0.1 and it fixed that *particular *runtime
exception: but then a similar one appears when saving to parquet:

An  SOF question on this was created a month ago and today further details plus
an open bounty were added to it:

https://stackoverflow.com/questions/49713485/spark-error-with-google-guava-library-java-lang-nosuchmethoderror-com-google-c

The new but similar exception is shown below:

The hack to downgrade to 2.0.1 does help - i.e. execution proceeds *further* :
but then when writing out to *parquet* the above error does happen.

8/05/07 11:26:11 ERROR Executor: Exception in task 0.0 in stage 2741.0
(TID 2618)
java.lang.NoSuchMethodError:
com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
at org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
at org.apache.hadoop.io.compress.CodecPool.(CodecPool.java:74)
at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.(CodecFactory.java:92)
at 
org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:169)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:303)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6



2018-05-07 10:30 GMT-07:00 Stephen Boesch <java...@gmail.com>:

> I am intermittently running into guava dependency issues across mutiple
> spark projects.  I have tried maven shade / relocate but it does not
> resolve the issues.
>
> The current project is extremely simple: *no* additional dependencies
> beyond scala, spark, and scalatest - yet the issues remain (and yes mvn
> clean was re-applied).
>
> Is there a reliable approach to handling the versioning for guava within
> spark dependency projects?
>
>
> [INFO] 
> 
> [INFO] Building ccapps_final 1.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ ccapps_final ---
> 18/05/07 10:24:00 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> [WARNING]
> java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.
> refreshAfterWrite(JLjava/util/concurrent/TimeUnit;)Lcom/
> google/common/cache/CacheBuilder;
> at org.apache.hadoop.security.Groups.(Groups.java:96)
> at org.apache.hadoop.security.Groups.(Groups.java:73)
> at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(
> Groups.java:293)
> at org.apache.hadoop.security.UserGroupInformation.initialize(
> UserGroupInformation.java:283)
> at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(
> UserGroupInformation.java:260)
> at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
> UserGroupInformation.java:789)
> at org.apache.hadoop.security.UserGroupInformation.getLoginUser(
> UserGroupInformation.java:774)
> at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(
> UserGroupInformation.java:647)
> at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.
> apply(Utils.scala:2424)
> at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.
> apply(Utils.scala:2424)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2424)
> at org.apache.spark.SparkContext.(SparkContext.scala:295)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
&g

Guava dependency issue

2018-05-07 Thread Stephen Boesch
I am intermittently running into guava dependency issues across mutiple
spark projects.  I have tried maven shade / relocate but it does not
resolve the issues.

The current project is extremely simple: *no* additional dependencies
beyond scala, spark, and scalatest - yet the issues remain (and yes mvn
clean was re-applied).

Is there a reliable approach to handling the versioning for guava within
spark dependency projects?


[INFO]

[INFO] Building ccapps_final 1.0-SNAPSHOT
[INFO]

[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ ccapps_final ---
18/05/07 10:24:00 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
[WARNING]
java.lang.NoSuchMethodError:
com.google.common.cache.CacheBuilder.refreshAfterWrite(JLjava/util/concurrent/TimeUnit;)Lcom/google/common/cache/CacheBuilder;
at org.apache.hadoop.security.Groups.(Groups.java:96)
at org.apache.hadoop.security.Groups.(Groups.java:73)
at
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2424)
at
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2424)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2424)
at org.apache.spark.SparkContext.(SparkContext.scala:295)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)
at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:910)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:910)
at luisvictorsteve.SlackSentiment$.getSpark(SlackSentiment.scala:10)
at luisvictorsteve.SlackSentiment$.main(SlackSentiment.scala:16)
at luisvictorsteve.SlackSentiment.main(SlackSentiment.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
[INFO]



Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
While it is certainly possible to use VM I have seen in a number of places
warnings that collect() results must be able to be fit in memory. I'm not
sure if that applies to *all" spark calculations: but in the very least
each of the specific collect()'s that are performed would need to be
verified.

And maybe *all *collects do require sufficient memory - would you like to
check the source code to see if there were disk backed collects actually
happening for some cases?

2018-04-28 9:48 GMT-07:00 Deepak Goel <deic...@gmail.com>:

> There is something as *virtual memory*
>
> On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <java...@gmail.com> wrote:
>
>> Do you have a machine with  terabytes of RAM?  afaik collect() requires
>> RAM - so that would be your limiting factor.
>>
>> 2018-04-28 8:41 GMT-07:00 klrmowse <klrmo...@gmail.com>:
>>
>>> i am currently trying to find a workaround for the Spark application i am
>>> working on so that it does not have to use .collect()
>>>
>>> but, for now, it is going to have to use .collect()
>>>
>>> what is the size limit (memory for the driver) of RDD file that
>>> .collect()
>>> can work with?
>>>
>>> i've been scouring google-search - S.O., blogs, etc, and everyone is
>>> cautioning about .collect(), but does not specify how huge is huge...
>>> are we
>>> talking about a few gigabytes? terabytes?? petabytes???
>>>
>>>
>>>
>>> thank you
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM
- so that would be your limiting factor.

2018-04-28 8:41 GMT-07:00 klrmowse :

> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use .collect()
>
> but, for now, it is going to have to use .collect()
>
> what is the size limit (memory for the driver) of RDD file that .collect()
> can work with?
>
> i've been scouring google-search - S.O., blogs, etc, and everyone is
> cautioning about .collect(), but does not specify how huge is huge... are
> we
> talking about a few gigabytes? terabytes?? petabytes???
>
>
>
> thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Stephen Boesch
While MLLib performed favorably vs Flink it *also *performed favorably vs
spark.ml ..  and by an *order of magnitude*.  The following is one of the
tables - it is for Logistic Regression.  At that time spark.ML did not yet
support SVM

From: https://bdataanalytics.biomedcentral.com/articles/10.
1186/s41044-016-0020-2



Table 3

LR learning time in seconds

Dataset

Spark MLlib

Spark ML

Flink

ECBDL14-10

3

26

181

ECBDL14-30

5

63

815

ECBDL14-50

6

173

1314

ECBDL14-75

8

260

1878

ECBDL14-100

12

415

2566

The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than
had been anticipated - yet the latter has been shutdown for several
versions of Spark already.  What is the thought process behind that
decision : *performance matters! *Is there visibility into a meaningful
narrowing of that gap?


Re: Anyone know where to find independent contractors in New York?

2017-12-21 Thread Stephen Boesch
Hi Richard, this is not a jobs board: please only discuss spark application
development issues.

2017-12-21 8:34 GMT-08:00 Richard L. Burton III :

> I'm trying to locate four independent contractors who have experience with
> Spark. I'm not sure where I can go to find experienced Spark consultants.
>
> Please, no recruiters.
> --
> -Richard L. Burton III
>
>


Re: LDA and evaluating topic number

2017-12-07 Thread Stephen Boesch
I have been testing on the 20 NewsGroups dataset - which the Spark docs
themselves reference.  I can confirm that perplexity increases and
likelihood decreases as topics increase - and am similarly confused by
these results.

2017-09-28 10:50 GMT-07:00 Cody Buntain :

> Hi, all!
>
> Is there an example somewhere on using LDA’s logPerplexity()/logLikelihood()
> functions to evaluate topic counts? The existing MLLib LDA examples show
> calling them, but I can’t find any documentation about how to interpret the
> outputs. Graphing the outputs for logs of perplexity and likelihood aren’t
> consistent with what I expected (perplexity increases and likelihood
> decreases as topics increase, which seem odd to me).
>
> An example of what I’m doing is here: http://www.cs.umd.edu/~
> cbuntain/FindTopicK-pyspark-regex.html
>
> Thanks very much in advance! If I can figure this out, I can post example
> code online, so others can see how this process is done.
>
> -Best regards,
> Cody
> _
> Cody Buntain, PhD
> Postdoc, @UMD_CS
> Intelligence Community Postdoctoral Fellow
> cbunt...@cs.umd.edu
> www.cs.umd.edu/~cbuntain
>
>


Weight column values not used in Binary Logistic Regression Summary

2017-11-18 Thread Stephen Boesch
In BinaryLogisticRegressionSummary there are @Since("1.5.0") tags on a
number of comments identical to the following:

* @note This ignores instance weights (setting all to 1.0) from
`LogisticRegression.weightCol`.
* This will change in later Spark versions.


Are there any plans to address this? Our team is using instance weights
with sklearn LogisticRegression - and this limitation will complicate a
potential migration.


https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L1543


Re: Spark streaming for CEP

2017-10-24 Thread Stephen Boesch
Hi Mich, the github link has a brief intro - including a link to the formal
docs http://logisland.readthedocs.io/en/latest/index.html .   They have an
architectural overview, developer guide, tutorial, and pretty comprehensive
api docs.

2017-10-24 13:31 GMT-07:00 Mich Talebzadeh :

> thanks Thomas.
>
> do you have a summary write-up for this tool please?
>
>
> regards,
>
>
>
>
> Thomas
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 October 2017 at 13:53, Thomas Bailet 
> wrote:
>
>> Hi
>>
>> we (@ hurence) have released on open source middleware based on
>> SparkStreaming over Kafka to do CEP and log mining, called *logisland*  (
>> https://github.com/Hurence/logisland/) it has been deployed into
>> production for 2 years now and does a great job. You should have a look.
>>
>>
>> bye
>>
>> Thomas Bailet
>>
>> CTO : hurence
>>
>> Le 18/10/17 à 22:05, Mich Talebzadeh a écrit :
>>
>> As you may be aware the granularity that Spark streaming has is
>> micro-batching and that is limited to 0.5 second. So if you have continuous
>> ingestion of data then Spark streaming may not be granular enough for CEP.
>> You may consider other products.
>>
>> Worth looking at this old thread on mine "Spark support for Complex Event
>> Processing (CEP)
>>
>> https://mail-archives.apache.org/mod_mbox/spark-user/201604.
>> mbox/%3CCAJ3fcbB8eaf0JV84bA7XGUK5GajC1yGT3ZgTNCi8arJg56=LbQ@
>> mail.gmail.com%3E
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 18 October 2017 at 20:52, anna stax  wrote:
>>
>>> Hello all,
>>>
>>> Has anyone used spark streaming for CEP (Complex Event processing).  Any
>>> CEP libraries that works well with spark. I have a use case for CEP and
>>> trying to see if spark streaming is a good fit.
>>>
>>> Currently we have a data pipeline using Kafka, Spark streaming and
>>> Cassandra for data ingestion and near real time dashboard.
>>>
>>> Please share your experience.
>>> Thanks much.
>>> -Anna
>>>
>>>
>>>
>>
>>
>


Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Stephen Boesch
@Vadim   Would it be true to say the `.rdd` *may* be creating a new job -
depending on whether the DataFrame/DataSet had already been materialized
via an action or checkpoint?   If the only prior operations on the
DataFrame had been transformations then the dataframe would still not have
been calculated.  In that case would it also be true that a subsequent
action/checkpoint on the DataFrame (not the rdd) would then generate a
separate job?

2017-10-13 14:50 GMT-07:00 Vadim Semenov :

> When you do `Dataset.rdd` you actually create a new job
>
> here you can see what it does internally:
> https://github.com/apache/spark/blob/master/sql/core/src/
> main/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828
>
>
>
> On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <
> supun.nakand...@gmail.com> wrote:
>
>> Hi Weichen,
>>
>> Thank you for the reply.
>>
>> My understanding was Dataframe API is using the old RDD implementation
>> under the covers though it presents a different API. And calling
>> df.rdd will simply give access to the underlying RDD. Is this assumption
>> wrong? I would appreciate if you can shed more insights on this issue or
>> point me to documentation where I can learn them.
>>
>> Thank you in advance.
>>
>> On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu 
>> wrote:
>>
>>> You should use `df.cache()`
>>> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from
>>> the original `df`. and then cache the new RDD.
>>>
>>> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
>>> supun.nakand...@gmail.com> wrote:
>>>
 Hi all,

 I have been experimenting with cache/persist/unpersist methods with
 respect to both Dataframes and RDD APIs. However, I am experiencing
 different behaviors Ddataframe API compared RDD API such Dataframes are not
 getting cached when count() is called.

 Is there a difference between how these operations act wrt to Dataframe
 and RDD APIs?

 Thank You.
 -Supun

>>>
>>>
>>
>


Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
Thinking more carefully on your comment:

   - There may be some ambiguity as to whether the repo provided libraries
   are actually being used here - as you indicate - instead of the in-project
   classes. That would have to do with how the classpath inside IJ were
   constructed.   When I click through any of the spark classes in the IDE
   they do go directly to the in-project versions and not the repo jars: but
   that may not be definitive
   - In any case I had already performed the maven install and just
   verified that the jar's do have the correct timestamps in the local maven
   repo
   - The local maven repo is included by default  - so should not need to
   do anything special there

The same errors from the original post continue to occur.


2017-10-11 20:05 GMT-07:00 Stephen Boesch <java...@gmail.com>:

> A clarification here: the example is being run *from the Spark codebase*.
> Therefore the mvn install step would not be required as the classes are
> available directly within the project.
>
> The reason for needing the `mvn package` to be invoked is to pick up the
> changes of having updated the spark dependency scopes from *provided *to
> *compile*.
>
> As mentioned the spark unit tests are working (and within Intellij and
> without `mvn install`): only the examples are not.
>
> 2017-10-11 19:43 GMT-07:00 Paul <corley.paul1...@gmail.com>:
>
>> You say you did the maven package but did you do a maven install and
>> define your local maven repo in SBT?
>>
>> -Paul
>>
>> Sent from my iPhone
>>
>> On Oct 11, 2017, at 5:48 PM, Stephen Boesch <java...@gmail.com> wrote:
>>
>> When attempting to run any example program w/ Intellij I am running into
>> guava versioning issues:
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> com/google/common/cache/CacheLoader
>> at org.apache.spark.SparkConf.loadFromSystemProperties(SparkCon
>> f.scala:73)
>> at org.apache.spark.SparkConf.(SparkConf.scala:68)
>> at org.apache.spark.SparkConf.(SparkConf.scala:55)
>> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(S
>> parkSession.scala:919)
>> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(S
>> parkSession.scala:918)
>> at scala.Option.getOrElse(Option.scala:121)
>> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkS
>> ession.scala:918)
>> at org.apache.spark.examples.ml.KMeansExample$.main(KMeansExamp
>> le.scala:40)
>> at org.apache.spark.examples.ml.KMeansExample.main(KMeansExample.scala)
>> Caused by: java.lang.ClassNotFoundException:
>> com.google.common.cache.CacheLoader
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 9 more
>>
>> The *scope*s for the spark dependencies were already changed from
>> *provided* to *compile* .  Both `sbt assembly` and `mvn package` had
>> already been run (successfully) from command line - and the (mvn) project
>> completely rebuilt inside intellij.
>>
>> The spark testcases run fine: this is a problem only in the examples
>> module.  Anyone running these successfully in IJ?  I have tried for
>> 2.1.0-SNAPSHOT and 2.3.0-SNAPSHOT - with the same outcome.
>>
>>
>>
>


Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
A clarification here: the example is being run *from the Spark codebase*.
Therefore the mvn install step would not be required as the classes are
available directly within the project.

The reason for needing the `mvn package` to be invoked is to pick up the
changes of having updated the spark dependency scopes from *provided *to
*compile*.

As mentioned the spark unit tests are working (and within Intellij and
without `mvn install`): only the examples are not.

2017-10-11 19:43 GMT-07:00 Paul <corley.paul1...@gmail.com>:

> You say you did the maven package but did you do a maven install and
> define your local maven repo in SBT?
>
> -Paul
>
> Sent from my iPhone
>
> On Oct 11, 2017, at 5:48 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> When attempting to run any example program w/ Intellij I am running into
> guava versioning issues:
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> com/google/common/cache/CacheLoader
> at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
> at org.apache.spark.SparkConf.(SparkConf.scala:68)
> at org.apache.spark.SparkConf.(SparkConf.scala:55)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(
> SparkSession.scala:919)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(
> SparkSession.scala:918)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(
> SparkSession.scala:918)
> at org.apache.spark.examples.ml.KMeansExample$.main(KMeansExamp
> le.scala:40)
> at org.apache.spark.examples.ml.KMeansExample.main(KMeansExample.scala)
> Caused by: java.lang.ClassNotFoundException:
> com.google.common.cache.CacheLoader
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 9 more
>
> The *scope*s for the spark dependencies were already changed from
> *provided* to *compile* .  Both `sbt assembly` and `mvn package` had
> already been run (successfully) from command line - and the (mvn) project
> completely rebuilt inside intellij.
>
> The spark testcases run fine: this is a problem only in the examples
> module.  Anyone running these successfully in IJ?  I have tried for
> 2.1.0-SNAPSHOT and 2.3.0-SNAPSHOT - with the same outcome.
>
>
>


Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
When attempting to run any example program w/ Intellij I am running into
guava versioning issues:

Exception in thread "main" java.lang.NoClassDefFoundError:
com/google/common/cache/CacheLoader
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
at org.apache.spark.SparkConf.(SparkConf.scala:68)
at org.apache.spark.SparkConf.(SparkConf.scala:55)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$
7.apply(SparkSession.scala:919)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$
7.apply(SparkSession.scala:918)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
scala:918)
at org.apache.spark.examples.ml.KMeansExample$.main(KMeansExample.scala:40)
at org.apache.spark.examples.ml.KMeansExample.main(KMeansExample.scala)
Caused by: java.lang.ClassNotFoundException: com.google.common.cache.
CacheLoader
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more

The *scope*s for the spark dependencies were already changed from
*provided* to *compile* .  Both `sbt assembly` and `mvn package` had
already been run (successfully) from command line - and the (mvn) project
completely rebuilt inside intellij.

The spark testcases run fine: this is a problem only in the examples
module.  Anyone running these successfully in IJ?  I have tried for
2.1.0-SNAPSHOT and 2.3.0-SNAPSHOT - with the same outcome.


Re: SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
The correct link is
https://docs.databricks.com/spark/latest/spark-sql/index.html .

This link does have the core syntax such as the BNF for the DDL and DML and
SELECT.  It does *not *have a reference for  date / string / numeric
functions: is there any such reference at this point?  It is not sufficient
to peruse the DSL list of functions since the usage is different (and
sometimes the names as well)  than from the DSL.

thanks
stephenb

2017-08-10 14:49 GMT-07:00 Jules Damji <dmat...@comcast.net>:

> I refer to docs.databricks.com/Spark/latest/Spark-sql/index.html.
>
> Cheers
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> > On Aug 10, 2017, at 1:46 PM, Stephen Boesch <java...@gmail.com> wrote:
> >
> >
> > While the DataFrame/DataSets are useful in many circumstances they are
> cumbersome for many types of complex sql queries.
> >
> > Is there an up to date *SQL* reference - i.e. not DataFrame DSL
> operations - for version 2.2?
> >
> > An example of what is not clear:  what constructs are supported within
> >
> > select count( predicate) from some_table
> >
> > when using spark sql.
> >
> > But in general the reference guide and programming guide for SQL seems
> to be difficult to locate - seemingly in favor of the DataFrame/DataSets.
>
>


SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
While the DataFrame/DataSets are useful in many circumstances they are
cumbersome for many types of complex sql queries.

Is there an up to date *SQL* reference - i.e. not DataFrame DSL operations
- for version 2.2?

An example of what is not clear:  what constructs are supported within

select count( predicate) from some_table

when using spark sql.

But in general the reference guide and programming guide for SQL seems to
be difficult to locate - seemingly in favor of the DataFrame/DataSets.


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Stephen Boesch
Spark SQL did not support explicit partitioners even before tungsten: and
often enough this did hurt performance.  Even now Tungsten will not do the
best job every time: so the question from the OP is still germane.

2017-06-25 19:18 GMT-07:00 Ryan :

> Why would you like to do so? I think there's no need for us to explicitly
> ask for a forEachPartition in spark sql because tungsten is smart enough to
> figure out whether a sql operation could be applied on each partition or
> there has to be a shuffle.
>
> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:
>
>> You can do a map() using a select and functions/UDFs. But how do you
>> process a partition using SQL?
>>
>>
>>
>


Re: Using SparkContext in Executors

2017-05-28 Thread Stephen Boesch
You would need to use *native* Cassandra API's in each Executor -
not org.apache.spark.sql.cassandra.CassandraSQLContext
-  including to create a separate Cassandra connection on each Executor.

2017-05-28 15:47 GMT-07:00 Abdulfattah Safa :

> So I can't run SQL queries in Executors ?
>
> On Sun, May 28, 2017 at 11:00 PM Mark Hamstra 
> wrote:
>
>> You can't do that. SparkContext and SparkSession can exist only on the
>> Driver.
>>
>> On Sun, May 28, 2017 at 6:56 AM, Abdulfattah Safa 
>> wrote:
>>
>>> How can I use SparkContext (to create Spark Session or Cassandra
>>> Sessions) in executors?
>>> If I pass it as parameter to the foreach or foreachpartition, then it
>>> will have a null value.
>>> Shall I create a new SparkContext in each executor?
>>>
>>> Here is what I'm trying to do:
>>> Read a dump directory with millions of dump files as follows:
>>>
>>> dumpFiles = Directory.listFiles(dumpDirectory)
>>> dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
>>> dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
>>> .
>>> .
>>> .
>>>
>>> In parse(), each dump file is parsed and inserted into database using
>>> SparlSQL. In order to do that, SparkContext is needed in the function parse
>>> to use the sql() method.
>>>
>>
>>


Re: Jupyter spark Scala notebooks

2017-05-17 Thread Stephen Boesch
Jupyter with toree works well for my team.  Jupyter is well more refined vs
zeppelin as far as notebook features and usability: shortcuts, editing,etc.
  The caveat is it is better to run a separate server instanace for
python/pyspark vs scala/spark

2017-05-17 19:27 GMT-07:00 Richard Moorhead :

> Take a look at Apache Zeppelin; it has both python and scala interpreters.
> https://zeppelin.apache.org/
> Apache Zeppelin 
> zeppelin.apache.org
> Apache Zeppelin. A web-based notebook that enables interactive data
> analytics. You can make beautiful data-driven, interactive and
> collaborative documents with SQL ...
>
>
>
>
> . . . . . . . . . . . . . . . . . . . . . . . . . . .
>
> Richard Moorhead
> Software Engineer
> richard.moorh...@c2fo.com 
>
> C2FO: The World's Market for Working Capital®
>
>
> 
>     
> 
> 
> 
>
> The information contained in this message and any attachment may be
> privileged, confidential, and protected from disclosure. If you are not the
> intended recipient, or an employee, or agent responsible for delivering
> this message to the intended recipient, you are hereby notified that any
> dissemination, distribution, or copying of this communication is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by replying to the message and deleting from your computer.
>
> --
> *From:* upendra 1991 
> *Sent:* Wednesday, May 17, 2017 9:22:14 PM
> *To:* user@spark.apache.org
> *Subject:* Jupyter spark Scala notebooks
>
> What's the best way to use jupyter with Scala spark. I tried Apache toree
> and created a kernel but did not get it working. I believe there is a
> better way.
>
> Please suggest any best practices.
>
> Sent from Yahoo Mail on Android
> 
>


pyspark in intellij

2017-02-25 Thread Stephen Boesch
Anyone have this working - either in 1.X or 2.X?

thanks


Re: Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

2017-02-18 Thread Stephen Boesch
For now I have added to the log4j.properties:

log4j.logger.org.apache.parquet=ERROR

2017-02-18 11:50 GMT-08:00 Stephen Boesch <java...@gmail.com>:

> The following JIRA mentions that a fix made to read parquet 1.6.2 into 2.X  
> STILL leaves an "avalanche" of warnings:
>
>
> https://issues.apache.org/jira/browse/SPARK-17993
>
> Here is the text inside one of the last comments before it was merged:
>
>   I have built the code from the PR and it indeed succeeds reading the data.
>   I have tried doing df.count() and now I'm swarmed with warnings like this 
> (they are just keep getting printed endlessly in the terminal):
>   16/08/11 12:18:51 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-mr version 1.6.0
>   org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
> at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
> at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:369)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:343)
> at
>
> I am running 2.1.0 release and there are multitudes of these warnings. Is 
> there any way - short of changing logging level to ERROR - to suppress these?
>
>
>


Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

2017-02-18 Thread Stephen Boesch
The following JIRA mentions that a fix made to read parquet 1.6.2 into
2.X  STILL leaves an "avalanche" of warnings:


https://issues.apache.org/jira/browse/SPARK-17993

Here is the text inside one of the last comments before it was merged:

  I have built the code from the PR and it indeed succeeds reading the data.
  I have tried doing df.count() and now I'm swarmed with warnings like
this (they are just keep getting printed endlessly in the terminal):
  16/08/11 12:18:51 WARN CorruptStatistics: Ignoring statistics
because created_by could not be parsed (see PARQUET-251): parquet-mr
version 1.6.0
  org.apache.parquet.VersionParser$VersionParseException: Could not
parse created_by: parquet-mr version 1.6.0 using format: (.+) version
((.*) )?\(build ?(.*)\)
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
at 
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:369)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:343)
at

I am running 2.1.0 release and there are multitudes of these warnings.
Is there any way - short of changing logging level to ERROR - to
suppress these?


Re: Spark/Mesos with GPU support

2016-12-30 Thread Stephen Boesch
Would it be possible to share that communication?  I am interested in this
thread.

2016-12-30 11:02 GMT-08:00 Ji Yan :

> Thanks Michael, Tim and I have touched base and thankfully the issue has
> already been resolved
>
> On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt 
> wrote:
>
>> I've cc'd Tim and Kevin, who worked on GPU support.
>>
>> On Wed, Dec 28, 2016 at 11:22 AM, Ji Yan  wrote:
>>
>>> Dear Spark Users,
>>>
>>> Has anyone had successful experience running Spark on Mesos with GPU
>>> support? We have a Mesos cluster that can see and offer nvidia GPU
>>> resources. With Spark, it seems that the GPU support with Mesos (
>>> https://github.com/apache/spark/pull/14644) has only recently been
>>> merged into Spark Master which is not found in 2.0.2 release yet. We have a
>>> custom built Spark from 2.1-rc5 which is confirmed to have the above
>>> change. However when we try to run any code from Spark on this Mesos setup,
>>> the spark program hangs and keeps saying
>>>
>>> “WARN TaskSchedulerImpl: Initial job has not accepted any resources;
>>> check your cluster UI to ensure that workers are registered and have
>>> sufficient resources”
>>>
>>> We are pretty sure that the cluster has enough resources as there is
>>> nothing running on it. If we disable the GPU support in configuration and
>>> restart mesos and retry the same program, it would work.
>>>
>>> Any comment/advice on this greatly appreciated
>>>
>>> Thanks,
>>> Ji
>>>
>>>
>>> The information in this email is confidential and may be legally
>>> privileged. It is intended solely for the addressee. Access to this email
>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>> disclosure, copying, distribution or any action taken or omitted to be
>>> taken in reliance on it, is prohibited and may be unlawful.
>>>
>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>


Re: Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
This problem appears to be a regression on HEAD/master:  when running
against 2.0.2 the pyspark job completes successfully including running
predictions.

2016-11-23 19:36 GMT-08:00 Stephen Boesch <java...@gmail.com>:

>
> For a pyspark job with 54 executors all of the task outputs have a single
> line in both the stderr and stdout similar to:
>
> Error: invalid log directory 
> /shared/sparkmaven/work/app-20161119222540-/0/
>
>
> Note: the directory /shared/sparkmaven/work exists and is owned by the
> same user running the job. There are plenty of other app-*** subdirectories
> that do have contents in the stdout/stderr files.
>
>
> $ls -lrta  /shared/sparkmaven/work
> total 0
> drwxr-xr-x  59 steve  staff  2006 Nov 23 05:01 ..
> drwxr-xr-x  41 steve  staff  1394 Nov 23 18:22 app-20161123050122-0002
> drwxr-xr-x   6 steve  staff   204 Nov 23 18:22 app-20161123182031-0005
> drwxr-xr-x   6 steve  staff   204 Nov 23 18:44 app-20161123184349-0006
> drwxr-xr-x   6 steve  staff   204 Nov 23 18:46 app-20161123184613-0007
> drwxr-xr-x   3 steve  staff   102 Nov 23 19:20 app-20161123192048-0008
>
>
>
> Here is a sample of the contents
>
> /shared/sparkmaven/work/app-20161123184613-0007/2:
> total 16
> -rw-r--r--  1 steve  staff 0 Nov 23 18:46 stdout
> drwxr-xr-x  4 steve  staff   136 Nov 23 18:46 .
> -rw-r--r--  1 steve  staff  4830 Nov 23 18:46 stderr
> drwxr-xr-x  6 steve  staff   204 Nov 23 18:46 ..
>
> /shared/sparkmaven/work/app-20161123184613-0007/3:
> total 16
> -rw-r--r--  1 steve  staff 0 Nov 23 18:46 stdout
> drwxr-xr-x  6 steve  staff   204 Nov 23 18:46 ..
> drwxr-xr-x  4 steve  staff   136 Nov 23 18:46 .
> -rw-r--r--  1 steve  staff  4830 Nov 23 18:46 stderr
>
>
> Note also:  the *SparkPI* program does run succesfully - which validates
> the basic spark installation/functionality.
>
>


Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
For a pyspark job with 54 executors all of the task outputs have a single
line in both the stderr and stdout similar to:

Error: invalid log directory /shared/sparkmaven/work/app-20161119222540-/0/


Note: the directory /shared/sparkmaven/work exists and is owned by the same
user running the job. There are plenty of other app-*** subdirectories that
do have contents in the stdout/stderr files.


$ls -lrta  /shared/sparkmaven/work
total 0
drwxr-xr-x  59 steve  staff  2006 Nov 23 05:01 ..
drwxr-xr-x  41 steve  staff  1394 Nov 23 18:22 app-20161123050122-0002
drwxr-xr-x   6 steve  staff   204 Nov 23 18:22 app-20161123182031-0005
drwxr-xr-x   6 steve  staff   204 Nov 23 18:44 app-20161123184349-0006
drwxr-xr-x   6 steve  staff   204 Nov 23 18:46 app-20161123184613-0007
drwxr-xr-x   3 steve  staff   102 Nov 23 19:20 app-20161123192048-0008



Here is a sample of the contents

/shared/sparkmaven/work/app-20161123184613-0007/2:
total 16
-rw-r--r--  1 steve  staff 0 Nov 23 18:46 stdout
drwxr-xr-x  4 steve  staff   136 Nov 23 18:46 .
-rw-r--r--  1 steve  staff  4830 Nov 23 18:46 stderr
drwxr-xr-x  6 steve  staff   204 Nov 23 18:46 ..

/shared/sparkmaven/work/app-20161123184613-0007/3:
total 16
-rw-r--r--  1 steve  staff 0 Nov 23 18:46 stdout
drwxr-xr-x  6 steve  staff   204 Nov 23 18:46 ..
drwxr-xr-x  4 steve  staff   136 Nov 23 18:46 .
-rw-r--r--  1 steve  staff  4830 Nov 23 18:46 stderr


Note also:  the *SparkPI* program does run succesfully - which validates
the basic spark installation/functionality.


Re: HPC with Spark? Simultaneous, parallel one to one mapping of partition to vcore

2016-11-19 Thread Stephen Boesch
While "apparently" saturating the N available workers using your proposed N
partitions - the "actual" distribution of workers to tasks is controlled by
the scheduler.  If my past experience were of service - you can *not *trust
the default Fair Scheduler to ensure the round-robin scheduling of the
tasks: you may well end up with tasks being queued.

The suggestion is to try it out on the resource manager and scheduler being
used for your deployment. You may need to swap out their default scheduler
for a true round robin.

2016-11-19 16:44 GMT-08:00 Adam Smith :

> Dear community,
>
> I have a RDD with N rows and N partitions. I want to ensure that the
> partitions run all at the some time, by setting the number of vcores
> (spark-yarn) to N. The partitions need to talk to each other with some
> socket based sync that is why I need them to run more or less
> simultaneously.
>
> Let's assume no node will die. Will my setup guarantee that all partitions
> are computed in parallel?
>
> I know this is somehow hackish. Is there a better way doing so?
>
> My goal is replicate message passing (like OpenMPI) with spark, where I
> have very specific and final communcation requirements. So no need for the
> many comm and sync funtionality, just what I already have - sync and talk.
>
> Thanks!
> Adam
>
>


Spark-packages

2016-11-06 Thread Stephen Boesch
What is the state of the spark-packages project(s) ?  When running a query
for machine learning algorithms the results are not encouraging.


https://spark-packages.org/?q=tags%3A%22Machine%20Learning%22

There are 62 packages. Only a few have actual releases - and even less with
dates in the past twelve months.

There are several from DataBricks among the chosen few that have recent
releases.

Here is one that actually seems to be in reasonable shape: the DB port of
Stanford coreNLP.

https://github.com/databricks/spark-corenlp

But .. one or two solid packages .. ?

It seems the  spark-packages approach seems not to have picked up  steam..
  Are there other places suggested to look for algorithms not included in
mllib itself ?


Re: Use BLAS object for matrix operation

2016-11-03 Thread Stephen Boesch
It is private. You will need to put your code in that same package or
create an accessor to it living within that package

private[spark]


2016-11-03 16:04 GMT-07:00 Yanwei Zhang :

> I would like to use some matrix operations in the BLAS object defined in
> ml.linalg. But for some reason, spark shell complains it cannot locate this
> object. I have constructed an example below to illustrate the issue. Please
> advise how to fix this. Thanks .
>
>
>
> import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors}
>
>
> val a = Vectors.dense(1.0, 2.0)
> val b = Vectors.dense(2.0, 3.0)
> BLAS.dot(a, b)
>
>
> :42: error: not found: value BLAS
>
>
>


Re: Aggregation Calculation

2016-11-03 Thread Stephen Boesch
You would likely want to create inline views that perform the filtering *before
*performing t he cubes/rollup; in this way the cubes/rollups only operate
on the pruned rows/columns.

2016-11-03 11:29 GMT-07:00 Andrés Ivaldi :

> Hello, I need to perform some aggregations and a kind of Cube/RollUp
> calculation
>
> Doing some test looks like Cube and RollUp performs aggregation over all
> posible columns combination, but I just need some specific columns
> combination.
>
> What I'm trying to do is like a dataTable where te first N columns are may
> rows and the second M values are my columns and the last columna are the
> aggregated values, like Dimension / Measures
>
> I need all the values of the N and M columns and the ones that correspond
> to the aggregation function. I'll never need the values that previous
> column has no value, ie
>
> having N=2 so two columns as rows I'll need
> R1 | R2  
> ##  |  ## 
> ##  |   null 
>
> but not
> null | ## 
>
> as roll up does, same approach to M columns
>
>
> So the question is what could be the better way to perform this
> calculation.
> Using rollUp/Cube give me a lot of values that I dont need
> Using groupBy give me less information ( I could do several groupBy but
> that is not performant, I think )
> Is any other way to something like that?
>
> Thanks.
>
>
>
>
>
> --
> Ing. Ivaldi Andres
>


Re: Logging trait in Spark 2.0

2016-06-28 Thread Stephen Boesch
I also did not understand why the Logging class was made private in Spark
2.0.  In a couple of projects including CaffeOnSpark the Logging class was
simply copied to the new project to allow for backwards compatibility.

2016-06-28 18:10 GMT-07:00 Michael Armbrust :

> I'd suggest using the slf4j APIs directly.  They provide a nice stable API
> that works with a variety of logging backends.  This is what Spark does
> internally.
>
> On Sun, Jun 26, 2016 at 4:02 AM, Paolo Patierno 
> wrote:
>
>> Yes ... the same here ... I'd like to know the best way for adding
>> logging in a custom receiver for Spark Streaming 2.0
>>
>> *Paolo Patierno*
>>
>> *Senior Software Engineer (IoT) @ Red Hat**Microsoft MVP on **Windows
>> Embedded & IoT*
>> *Microsoft Azure Advisor*
>>
>> Twitter : @ppatierno 
>> Linkedin : paolopatierno 
>> Blog : DevExperience 
>>
>>
>> --
>> From: jonathaka...@gmail.com
>> Date: Fri, 24 Jun 2016 20:56:40 +
>> Subject: Re: Logging trait in Spark 2.0
>> To: yuzhih...@gmail.com; ppatie...@live.com
>> CC: user@spark.apache.org
>>
>>
>> Ted, how is that thread related to Paolo's question?
>>
>> On Fri, Jun 24, 2016 at 1:50 PM Ted Yu  wrote:
>>
>> See this related thread:
>>
>>
>> http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+
>>
>> On Fri, Jun 24, 2016 at 6:07 AM, Paolo Patierno 
>> wrote:
>>
>> Hi,
>>
>> developing a Spark Streaming custom receiver I noticed that the Logging
>> trait isn't accessible anymore in Spark 2.0.
>>
>> trait Logging in package internal cannot be accessed in package
>> org.apache.spark.internal
>>
>> For developing a custom receiver what is the preferred way for logging ?
>> Just using log4j dependency as any other Java/Scala library/application ?
>>
>> Thanks,
>> Paolo
>>
>> *Paolo Patierno*
>>
>> *Senior Software Engineer (IoT) @ Red Hat**Microsoft MVP on **Windows
>> Embedded & IoT*
>> *Microsoft Azure Advisor*
>>
>> Twitter : @ppatierno 
>> Linkedin : paolopatierno 
>> Blog : DevExperience 
>>
>>
>>
>


Custom Optimizer

2016-06-23 Thread Stephen Boesch
My team has a custom optimization routine that we would have wanted to plug
in as a replacement for the default  LBFGS /  OWLQN for use by some of the
ml/mllib algorithms.

However it seems the choice of optimizer is hard-coded in every algorithm
except LDA: and even in that one it is only a choice between the internally
defined Online or batch version.

Any suggestions on how we might be able to incorporate our own optimizer?
Or do we need to roll all of our algorithms from top to bottom - basically
side stepping ml/mllib?

thanks
stephen


Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
I just checked out completely fresh directory and created new IJ project.
Then followed your tip for adding the avro source.

Here is an additional set of errors

Error:(31, 12) object razorvine is not a member of package net
import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler}
   ^
Error:(779, 49) not found: type IObjectPickler
  class PythonMessageAndMetadataPickler extends IObjectPickler {
^
Error:(783, 7) not found: value Pickler
  Pickler.registerCustomPickler(classOf[PythonMessageAndMetadata], this)
  ^
Error:(784, 7) not found: value Pickler
  Pickler.registerCustomPickler(this.getClass, this)
  ^
Error:(787, 57) not found: type Pickler
def pickle(obj: Object, out: OutputStream, pickler: Pickler) {
^
Error:(789, 19) not found: value Opcodes
out.write(Opcodes.GLOBAL)
  ^
Error:(794, 19) not found: value Opcodes
out.write(Opcodes.MARK)
  ^
Error:(800, 19) not found: value Opcodes
out.write(Opcodes.TUPLE)
  ^
Error:(801, 19) not found: value Opcodes
out.write(Opcodes.REDUCE)
  ^

2016-06-22 23:49 GMT-07:00 Stephen Boesch <java...@gmail.com>:

> Thanks Jeff - I remember that now from long time ago.  After making that
> change the next errors are:
>
> Error:scalac: missing or invalid dependency detected while loading class
> file 'RDDOperationScope.class'.
> Could not access term fasterxml in package com,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'RDDOperationScope.class' was compiled against
> an incompatible version of com.
> Error:scalac: missing or invalid dependency detected while loading class
> file 'RDDOperationScope.class'.
> Could not access term jackson in value com.fasterxml,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'RDDOperationScope.class' was compiled against
> an incompatible version of com.fasterxml.
> Error:scalac: missing or invalid dependency detected while loading class
> file 'RDDOperationScope.class'.
> Could not access term annotation in value com.jackson,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'RDDOperationScope.class' was compiled against
> an incompatible version of com.jackson.
> Error:scalac: missing or invalid dependency detected while loading class
> file 'RDDOperationScope.class'.
> Could not access term JsonInclude in value com.annotation,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'RDDOperationScope.class' was compiled against
> an incompatible version of com.annotation.
> Warning:scalac: Class org.jboss.netty.channel.ChannelFactory not found -
> continuing with a stub.
>
>
> 2016-06-22 23:39 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:
>
>> You need to
>> spark/external/flume-sink/target/scala-2.11/src_managed/main/compiled_avro
>> under build path, this is the only thing you need to do manually if I
>> remember correctly.
>>
>>
>>
>> On Thu, Jun 23, 2016 at 2:30 PM, Stephen Boesch <java...@gmail.com>
>> wrote:
>>
>>> Hi Jeff,
>>>   I'd like to understand what may be different. I have rebuilt and
>>> reimported many times.  Just now I blew away the .idea/* and *.iml to start
>>> from scratch.  I just opened the $SPARK_HOME directory from intellij File |
>>> Open  .  After it finished the initial import I tried to run one of the
>>> Examples - and it fails in the build:
>>>
>>> Here are the errors I see:
>>>
>>> Error:(45, 66) not found: type SparkFlumeProtocol
>>>   val transactionTimeout: Int, val backOffInterval: Int) extends
>>> SparkFlumeProtocol with Logging {
>>>  ^
>>> Error:(70, 39) not found: type EventBatch
>>>   override def getEventBatch(n: Int): EventBatch = {
>>>   ^
>>> Error:(85, 13) not found: type EventBatch
>>> new EventBatch("Spark sink has be

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
Thanks Jeff - I remember that now from long time ago.  After making that
change the next errors are:

Error:scalac: missing or invalid dependency detected while loading class
file 'RDDOperationScope.class'.
Could not access term fasterxml in package com,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
the problematic classpath.)
A full rebuild may help if 'RDDOperationScope.class' was compiled against
an incompatible version of com.
Error:scalac: missing or invalid dependency detected while loading class
file 'RDDOperationScope.class'.
Could not access term jackson in value com.fasterxml,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
the problematic classpath.)
A full rebuild may help if 'RDDOperationScope.class' was compiled against
an incompatible version of com.fasterxml.
Error:scalac: missing or invalid dependency detected while loading class
file 'RDDOperationScope.class'.
Could not access term annotation in value com.jackson,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
the problematic classpath.)
A full rebuild may help if 'RDDOperationScope.class' was compiled against
an incompatible version of com.jackson.
Error:scalac: missing or invalid dependency detected while loading class
file 'RDDOperationScope.class'.
Could not access term JsonInclude in value com.annotation,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
the problematic classpath.)
A full rebuild may help if 'RDDOperationScope.class' was compiled against
an incompatible version of com.annotation.
Warning:scalac: Class org.jboss.netty.channel.ChannelFactory not found -
continuing with a stub.


2016-06-22 23:39 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:

> You need to
> spark/external/flume-sink/target/scala-2.11/src_managed/main/compiled_avro
> under build path, this is the only thing you need to do manually if I
> remember correctly.
>
>
>
> On Thu, Jun 23, 2016 at 2:30 PM, Stephen Boesch <java...@gmail.com> wrote:
>
>> Hi Jeff,
>>   I'd like to understand what may be different. I have rebuilt and
>> reimported many times.  Just now I blew away the .idea/* and *.iml to start
>> from scratch.  I just opened the $SPARK_HOME directory from intellij File |
>> Open  .  After it finished the initial import I tried to run one of the
>> Examples - and it fails in the build:
>>
>> Here are the errors I see:
>>
>> Error:(45, 66) not found: type SparkFlumeProtocol
>>   val transactionTimeout: Int, val backOffInterval: Int) extends
>> SparkFlumeProtocol with Logging {
>>  ^
>> Error:(70, 39) not found: type EventBatch
>>   override def getEventBatch(n: Int): EventBatch = {
>>   ^
>> Error:(85, 13) not found: type EventBatch
>> new EventBatch("Spark sink has been stopped!", "",
>> java.util.Collections.emptyList())
>> ^
>>
>> /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/TransactionProcessor.scala
>> Error:(80, 22) not found: type EventBatch
>>   def getEventBatch: EventBatch = {
>>  ^
>> Error:(48, 37) not found: type EventBatch
>>   @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
>> Error", "",
>> ^
>> Error:(48, 54) not found: type EventBatch
>>   @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
>> Error", "",
>>  ^
>> Error:(115, 41) not found: type SparkSinkEvent
>> val events = new util.ArrayList[SparkSinkEvent](maxBatchSize)
>> ^
>> Error:(146, 28) not found: type EventBatch
>>   eventBatch = new EventBatch("", seqNum, events)
>>^
>>
>> /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSinkUtils.scala
>> Error:(25, 27) not found: type EventBatch
>>   def isErrorBatch(batch: EventBatch): Boolean = {
>>   ^
>>
>> /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala
>> Error:(86, 51) not found: type SparkFlumeProtocol
>> val responder

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
Hi Jeff,
  I'd like to understand what may be different. I have rebuilt and
reimported many times.  Just now I blew away the .idea/* and *.iml to start
from scratch.  I just opened the $SPARK_HOME directory from intellij File |
Open  .  After it finished the initial import I tried to run one of the
Examples - and it fails in the build:

Here are the errors I see:

Error:(45, 66) not found: type SparkFlumeProtocol
  val transactionTimeout: Int, val backOffInterval: Int) extends
SparkFlumeProtocol with Logging {
 ^
Error:(70, 39) not found: type EventBatch
  override def getEventBatch(n: Int): EventBatch = {
  ^
Error:(85, 13) not found: type EventBatch
new EventBatch("Spark sink has been stopped!", "",
java.util.Collections.emptyList())
^
/git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/TransactionProcessor.scala
Error:(80, 22) not found: type EventBatch
  def getEventBatch: EventBatch = {
 ^
Error:(48, 37) not found: type EventBatch
  @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
Error", "",
^
Error:(48, 54) not found: type EventBatch
  @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
Error", "",
 ^
Error:(115, 41) not found: type SparkSinkEvent
val events = new util.ArrayList[SparkSinkEvent](maxBatchSize)
^
Error:(146, 28) not found: type EventBatch
  eventBatch = new EventBatch("", seqNum, events)
   ^
/git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSinkUtils.scala
Error:(25, 27) not found: type EventBatch
  def isErrorBatch(batch: EventBatch): Boolean = {
  ^
/git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala
Error:(86, 51) not found: type SparkFlumeProtocol
val responder = new SpecificResponder(classOf[SparkFlumeProtocol],
handler.get)
  ^


Note: this is just the first batch of errors.




2016-06-22 20:50 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:

> It works well with me. You can try reimport it into intellij.
>
> On Thu, Jun 23, 2016 at 10:25 AM, Stephen Boesch <java...@gmail.com>
> wrote:
>
>>
>> Building inside intellij is an ever moving target. Anyone have the
>> magical procedures to get it going for 2.X?
>>
>> There are numerous library references that - although included in the
>> pom.xml build - are for some reason not found when processed within
>> Intellij.
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Building Spark 2.X in Intellij

2016-06-22 Thread Stephen Boesch
Building inside intellij is an ever moving target. Anyone have the magical
procedures to get it going for 2.X?

There are numerous library references that - although included in the
pom.xml build - are for some reason not found when processed within
Intellij.


Notebook(s) for Spark 2.0 ?

2016-06-20 Thread Stephen Boesch
Having looked closely at Jupyter, Zeppelin, and Spark-Notebook : only the
latter seems to be close to having support for Spark 2.X.

While I am interested in using Spark Notebook as soon as that support were
available are there alternatives that work *now*?  For example some
unmerged -yet -working fork(s) ?

thanks


Data Generators mllib -> ml

2016-06-20 Thread Stephen Boesch
There are around twenty data generators in mllib -none of which are
presently migrated to ml.

Here is an example

/**
 * :: DeveloperApi ::
 * Generate sample data used for SVM. This class generates uniform random values
 * for the features and adds Gaussian noise with weight 0.1 to generate labels.
 */
@DeveloperApi
@Since("0.8.0")
object SVMDataGenerator {


Will these be migrated - and is there any indication on a timeframe?
My intention would be to publicize these as non-deprecated for the 2.0
and 2.1 timeframes?


Re: Python to Scala

2016-06-17 Thread Stephen Boesch
What are you expecting us to do?  Yash provided a reasonable approach -
based on the info you had provided in prior emails.  Otherwise you can
convert it from python to spark - or find someone else who feels
comfortable to do it.  That kind of inquiry would likelybe appropriate on a
job board.



2016-06-17 21:47 GMT-07:00 Aakash Basu :

> Hey,
>
> Our complete project is in Spark on Scala, I code in Scala for Spark,
> though am new, but I know it and still learning. But I need help in
> converting this code to Scala. I've nearly no knowledge in Python, hence,
> requested the experts here.
>
> Hope you get me now.
>
> Thanks,
> Aakash.
> On 18-Jun-2016 10:07 AM, "Yash Sharma"  wrote:
>
>> You could use pyspark to run the python code on spark directly. That will
>> cut the effort of learning scala.
>>
>> https://spark.apache.org/docs/0.9.0/python-programming-guide.html
>>
>> - Thanks, via mobile,  excuse brevity.
>> On Jun 18, 2016 2:34 PM, "Aakash Basu"  wrote:
>>
>>> Hi all,
>>>
>>> I've a python code, which I want to convert to Scala for using it in a
>>> Spark program. I'm not so well acquainted with python and learning scala
>>> now. Any Python+Scala expert here? Can someone help me out in this please?
>>>
>>> Thanks & Regards,
>>> Aakash.
>>>
>>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Stephen Boesch
How many workers (/cpu cores) are assigned to this job?

2016-06-09 13:01 GMT-07:00 SRK :

> Hi,
>
> How to insert data into 2000 partitions(directories) of ORC/parquet  at a
> time using Spark SQL? It seems to be not performant when I try to insert
> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
> issue?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Stephen Boesch
ooc are the tables partitioned on a.pk and b.fk?  Hive might be using
copartitioning in that case: it is one of hive's strengths.

2016-06-09 7:28 GMT-07:00 Gourav Sengupta :

> Hi Mich,
>
> does not Hive use map-reduce? I thought it to be so. And since I am
> running the queries in EMR 4.6 therefore HIVE is not using TEZ.
>
>
> Regards,
> Gourav
>
> On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh  > wrote:
>
>> are you using map-reduce with Hive?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 June 2016 at 15:14, Gourav Sengupta 
>> wrote:
>>
>>> Hi,
>>>
>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>> here and is there a way we can optimize the queries in SPARK without the
>>> obvious hack in Query2.
>>>
>>>
>>> ---
>>> ENVIRONMENT:
>>> ---
>>>
>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>> million rows. Both the files are single gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>
>>> --
>>> QUERY1:
>>> --
>>> select A.PK, B.FK
>>> from A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>>
>>>
>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>
>>>
>>> --
>>> QUERY 2:
>>> --
>>>
>>> select A.PK, B.FK
>>> from (select PK from A) A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>> This query takes 4.5 mins in SPARK
>>>
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>>>
>>>
>>
>


Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-29 Thread Stephen Boesch
Thanks Bryan for that pointer : I will follow it. In the meantime the One
vs Rest appears to satisfy the requirements.

2016-05-29 15:40 GMT-07:00 Bryan Cutler <cutl...@gmail.com>:

> This is currently being worked on, planned for 2.1 I believe
> https://issues.apache.org/jira/browse/SPARK-7159
> On May 28, 2016 9:31 PM, "Stephen Boesch" <java...@gmail.com> wrote:
>
>> Thanks Phuong But the point of my post is how to achieve without using
>>  the deprecated the mllib pacakge. The mllib package already has
>>  multinomial regression built in
>>
>> 2016-05-28 21:19 GMT-07:00 Phuong LE-HONG <phuon...@gmail.com>:
>>
>>> Dear Stephen,
>>>
>>> Yes, you're right, LogisticGradient is in the mllib package, not ml
>>> package. I just want to say that we can build a multinomial logistic
>>> regression model from the current version of Spark.
>>>
>>> Regards,
>>>
>>> Phuong
>>>
>>>
>>>
>>> On Sun, May 29, 2016 at 12:04 AM, Stephen Boesch <java...@gmail.com>
>>> wrote:
>>> > Hi Phuong,
>>> >The LogisticGradient exists in the mllib but not ml package. The
>>> > LogisticRegression chooses either the breeze LBFGS - if L2 only (not
>>> elastic
>>> > net) and no regularization or the Orthant Wise Quasi Newton (OWLQN)
>>> > otherwise: it does not appear to choose GD in either scenario.
>>> >
>>> > If I have misunderstood your response please do clarify.
>>> >
>>> > thanks stephenb
>>> >
>>> > 2016-05-28 20:55 GMT-07:00 Phuong LE-HONG <phuon...@gmail.com>:
>>> >>
>>> >> Dear Stephen,
>>> >>
>>> >> The Logistic Regression currently supports only binary regression.
>>> >> However, the LogisticGradient does support computing gradient and loss
>>> >> for a multinomial logistic regression. That is, you can train a
>>> >> multinomial logistic regression model with LogisticGradient and a
>>> >> class to solve optimization like LBFGS to get a weight vector of the
>>> >> size (numClassrd-1)*numFeatures.
>>> >>
>>> >>
>>> >> Phuong
>>> >>
>>> >>
>>> >> On Sat, May 28, 2016 at 12:25 PM, Stephen Boesch <java...@gmail.com>
>>> >> wrote:
>>> >> > Followup: just encountered the "OneVsRest" classifier in
>>> >> > ml.classsification: I will look into using it with the binary
>>> >> > LogisticRegression as the provided classifier.
>>> >> >
>>> >> > 2016-05-28 9:06 GMT-07:00 Stephen Boesch <java...@gmail.com>:
>>> >> >>
>>> >> >>
>>> >> >> Presently only the mllib version has the one-vs-all approach for
>>> >> >> multinomial support.  The ml version with ElasticNet support only
>>> >> >> allows
>>> >> >> binary regression.
>>> >> >>
>>> >> >> With feature parity of ml vs mllib having been stated as an
>>> objective
>>> >> >> for
>>> >> >> 2.0.0 -  is there a projected availability of the  multinomial
>>> >> >> regression in
>>> >> >> the ml package?
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> `
>>> >> >
>>> >> >
>>> >
>>> >
>>>
>>
>>


Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Thanks Phuong But the point of my post is how to achieve without using  the
deprecated the mllib pacakge. The mllib package already has  multinomial
regression built in

2016-05-28 21:19 GMT-07:00 Phuong LE-HONG <phuon...@gmail.com>:

> Dear Stephen,
>
> Yes, you're right, LogisticGradient is in the mllib package, not ml
> package. I just want to say that we can build a multinomial logistic
> regression model from the current version of Spark.
>
> Regards,
>
> Phuong
>
>
>
> On Sun, May 29, 2016 at 12:04 AM, Stephen Boesch <java...@gmail.com>
> wrote:
> > Hi Phuong,
> >The LogisticGradient exists in the mllib but not ml package. The
> > LogisticRegression chooses either the breeze LBFGS - if L2 only (not
> elastic
> > net) and no regularization or the Orthant Wise Quasi Newton (OWLQN)
> > otherwise: it does not appear to choose GD in either scenario.
> >
> > If I have misunderstood your response please do clarify.
> >
> > thanks stephenb
> >
> > 2016-05-28 20:55 GMT-07:00 Phuong LE-HONG <phuon...@gmail.com>:
> >>
> >> Dear Stephen,
> >>
> >> The Logistic Regression currently supports only binary regression.
> >> However, the LogisticGradient does support computing gradient and loss
> >> for a multinomial logistic regression. That is, you can train a
> >> multinomial logistic regression model with LogisticGradient and a
> >> class to solve optimization like LBFGS to get a weight vector of the
> >> size (numClassrd-1)*numFeatures.
> >>
> >>
> >> Phuong
> >>
> >>
> >> On Sat, May 28, 2016 at 12:25 PM, Stephen Boesch <java...@gmail.com>
> >> wrote:
> >> > Followup: just encountered the "OneVsRest" classifier in
> >> > ml.classsification: I will look into using it with the binary
> >> > LogisticRegression as the provided classifier.
> >> >
> >> > 2016-05-28 9:06 GMT-07:00 Stephen Boesch <java...@gmail.com>:
> >> >>
> >> >>
> >> >> Presently only the mllib version has the one-vs-all approach for
> >> >> multinomial support.  The ml version with ElasticNet support only
> >> >> allows
> >> >> binary regression.
> >> >>
> >> >> With feature parity of ml vs mllib having been stated as an objective
> >> >> for
> >> >> 2.0.0 -  is there a projected availability of the  multinomial
> >> >> regression in
> >> >> the ml package?
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> `
> >> >
> >> >
> >
> >
>


Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Hi Phuong,
   The LogisticGradient exists in the mllib but not ml package. The
LogisticRegression chooses either the breeze LBFGS - if L2 only (not
elastic net) and no regularization or the Orthant Wise Quasi Newton (OWLQN)
otherwise: it does not appear to choose GD in either scenario.

If I have misunderstood your response please do clarify.

thanks stephenb

2016-05-28 20:55 GMT-07:00 Phuong LE-HONG <phuon...@gmail.com>:

> Dear Stephen,
>
> The Logistic Regression currently supports only binary regression.
> However, the LogisticGradient does support computing gradient and loss
> for a multinomial logistic regression. That is, you can train a
> multinomial logistic regression model with LogisticGradient and a
> class to solve optimization like LBFGS to get a weight vector of the
> size (numClassrd-1)*numFeatures.
>
>
> Phuong
>
>
> On Sat, May 28, 2016 at 12:25 PM, Stephen Boesch <java...@gmail.com>
> wrote:
> > Followup: just encountered the "OneVsRest" classifier in
> > ml.classsification: I will look into using it with the binary
> > LogisticRegression as the provided classifier.
> >
> > 2016-05-28 9:06 GMT-07:00 Stephen Boesch <java...@gmail.com>:
> >>
> >>
> >> Presently only the mllib version has the one-vs-all approach for
> >> multinomial support.  The ml version with ElasticNet support only allows
> >> binary regression.
> >>
> >> With feature parity of ml vs mllib having been stated as an objective
> for
> >> 2.0.0 -  is there a projected availability of the  multinomial
> regression in
> >> the ml package?
> >>
> >>
> >>
> >>
> >> `
> >
> >
>


Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Followup: just encountered the "OneVsRest" classifier in
 ml.classsification: I will look into using it with the binary
LogisticRegression as the provided classifier.

2016-05-28 9:06 GMT-07:00 Stephen Boesch <java...@gmail.com>:

>
> Presently only the mllib version has the one-vs-all approach for
> multinomial support.  The ml version with ElasticNet support only allows
> binary regression.
>
> With feature parity of ml vs mllib having been stated as an objective for
> 2.0.0 -  is there a projected availability of the  multinomial regression
> in the ml package?
>
>
>
>
> `
>


Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Presently only the mllib version has the one-vs-all approach for
multinomial support.  The ml version with ElasticNet support only allows
binary regression.

With feature parity of ml vs mllib having been stated as an objective for
2.0.0 -  is there a projected availability of the  multinomial regression
in the ml package?




`


Re: How to use the spark submit script / capability

2016-05-15 Thread Stephen Boesch
Hi Marcelo,  here is the JIRA
https://issues.apache.org/jira/browse/SPARK-4924

Jeff Zhang
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zjffdu> added
a comment - 26/Nov/15 08:15

Marcelo Vanzin
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=vanzin> Is
there any user document about it ? I didn't find it on the spark official
site. If this is not production ready, I think adding documentation to let
users know would be a good start.
<https://issues.apache.org/jira/browse/SPARK-4924#>
<https://issues.apache.org/jira/browse/SPARK-4924?focusedCommentId=15072396=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15072396>
Jiahongchao
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ihavenoemail%40163.com>
added
a comment - 28/Dec/15 03:51

Where is the official document?





2016-05-15 12:04 GMT-07:00 Marcelo Vanzin <van...@cloudera.com>:

> I don't understand your question. The PR you mention is not about
> spark-submit.
>
> If you want help with spark-submit, check the Spark docs or "spark-submit
> -h".
>
> If you want help with the library added in the PR, check Spark's API
> documentation.
>
>
> On Sun, May 15, 2016 at 9:33 AM, Stephen Boesch <java...@gmail.com> wrote:
> >
> > There is a committed PR from Marcelo Vanzin addressing that capability:
> >
> > https://github.com/apache/spark/pull/3916/files
> >
> > Is there any documentation on how to use this?  The PR itself has two
> > comments asking for the docs that were not answered.
>
>
>
> --
> Marcelo
>


How to use the spark submit script / capability

2016-05-15 Thread Stephen Boesch
There is a committed PR from Marcelo Vanzin addressing that capability:

https://github.com/apache/spark/pull/3916/files

Is there any documentation on how to use this?  The PR itself has two
comments asking for the docs that were not answered.


Re: spark task scheduling delay

2016-01-20 Thread Stephen Boesch
Which Resource Manager  are you using?

2016-01-20 21:38 GMT-08:00 Renu Yadav :

> Any suggestions?
>
> On Wed, Jan 20, 2016 at 6:50 PM, Renu Yadav  wrote:
>
>> Hi ,
>>
>> I am facing spark   task scheduling delay issue in spark 1.4.
>>
>> suppose I have 1600 tasks running then 1550 tasks runs fine but for the
>> remaining 50 i am facing task delay even if the input size of these task is
>> same as the above 1550 tasks
>>
>> Please suggest some solution.
>>
>> Thanks & Regards
>> Renu Yadav
>>
>
>


Re: Recommendations using Spark

2016-01-07 Thread Stephen Boesch
Alternating least squares takes  an RDD of (user/product/ratings) tuples
and the resulting Model provides predict(user, product) or predictProducts
methods among others.


Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Stephen Boesch
The postgres jdbc driver needs to be added to the  classpath of your spark
workers.  You can do a search for how to do that (multiple ways).

2015-12-22 17:22 GMT-08:00 b2k70 :

> I see in the Spark SQL documentation that a temporary table can be created
> directly onto a remote PostgreSQL table.
>
> CREATE TEMPORARY TABLE 
> USING org.apache.spark.sql.jdbc
> OPTIONS (
> url "jdbc:postgresql:///",
> dbtable "impressions"
> );
> When I run this against our PostgreSQL server, I get the following error.
>
> Error: java.sql.SQLException: No suitable driver found for
> jdbc:postgresql:/// (state=,code=0)
>
> Can someone help me understand why this is?
>
> Thanks, Ben
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-5-2-missing-JDBC-driver-for-PostgreSQL-tp25773.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Stephen Boesch
HI Benjamin,  yes by adding to the thrift server then the create table
would work.  But querying is performed by the workers: so you need to add
to the classpath of all nodes for reads to work.

2015-12-22 18:35 GMT-08:00 Benjamin Kim <bbuil...@gmail.com>:

> Hi Stephen,
>
> I forgot to mention that I added these lines below to the
> spark-default.conf on the node with Spark SQL Thrift JDBC/ODBC Server
> running on it. Then, I restarted it.
>
> spark.driver.extraClassPath=/usr/share/java/postgresql-9.3-1104.jdbc41.jar
>
> spark.executor.extraClassPath=/usr/share/java/postgresql-9.3-1104.jdbc41.jar
>
> I read in another thread that this would work. I was able to create the
> table and could see it in my SHOW TABLES list. But, when I try to query the
> table, I get the same error. It looks like I’m getting close.
>
> Are there any other things that I have to do that you can think of?
>
> Thanks,
> Ben
>
>
> On Dec 22, 2015, at 6:25 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> The postgres jdbc driver needs to be added to the  classpath of your spark
> workers.  You can do a search for how to do that (multiple ways).
>
> 2015-12-22 17:22 GMT-08:00 b2k70 <bbuil...@gmail.com>:
>
>> I see in the Spark SQL documentation that a temporary table can be created
>> directly onto a remote PostgreSQL table.
>>
>> CREATE TEMPORARY TABLE 
>> USING org.apache.spark.sql.jdbc
>> OPTIONS (
>> url "jdbc:postgresql:///",
>> dbtable "impressions"
>> );
>> When I run this against our PostgreSQL server, I get the following error.
>>
>> Error: java.sql.SQLException: No suitable driver found for
>> jdbc:postgresql:/// (state=,code=0)
>>
>> Can someone help me understand why this is?
>>
>> Thanks, Ben
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-5-2-missing-JDBC-driver-for-PostgreSQL-tp25773.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> <http://nabble.com>.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Scala VS Java VS Python

2015-12-16 Thread Stephen Boesch
There are solid reasons to have built spark on the jvm vs python. The
question for Daniel appear to be at this point scala vs java8. For that
there are many comparisons already available: but in the case of working
with spark there is the additional benefit for the scala side that the core
libraries are in that language.

2015-12-16 13:41 GMT-08:00 Darren Govoni :

> I use python too. I'm actually surprises it's not the primary language
> since it is by far more used in data science than java snd Scala combined.
>
> If I had a second choice of script language for general apps I'd want
> groovy over scala.
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: Daniel Lopes 
> Date: 12/16/2015 4:16 PM (GMT-05:00)
> To: Daniel Valdivia 
> Cc: user 
> Subject: Re: Scala VS Java VS Python
>
> For me Scala is better like Spark is written in Scala, and I like python
> cuz I always used python for data science. :)
>
> On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia 
> wrote:
>
>> Hello,
>>
>> This is more of a "survey" question for the community, you can reply to
>> me directly so we don't flood the mailing list.
>>
>> I'm having a hard time learning Spark using Python since the API seems to
>> be slightly incomplete, so I'm looking at my options to start doing all my
>> apps in either Scala or Java, being a Java Developer, java 1.8 looks like
>> the logical way, however I'd like to ask here what's the most common (Scala
>> Or Java) since I'm observing mixed results in the social documentation,
>> however Scala seems to be the predominant language for spark examples.
>>
>> Thank for the advice
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> *Daniel Lopes, B.Eng*
> Data Scientist - BankFacil
> CREA/SP 5069410560
> 
> Mob +55 (18) 99764-2733 
> Ph +55 (11) 3522-8009
> http://about.me/dannyeuu
>
> Av. Nova Independência, 956, São Paulo, SP
> Bairro Brooklin Paulista
> CEP 04570-001
> https://www.bankfacil.com.br
>
>


Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Stephen Boesch
@Yu Fengdong:  Your approach - specifically the groupBy results in a
shuffle does it not?

2015-12-04 2:02 GMT-08:00 Fengdong Yu :

> There are many ways, one simple is:
>
> such as: you want to know how many rows for each month:
>
>
> sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count
>
>
> the output looks like:
>
> monthcount
> 201411100
> 201412200
>
>
> hopes help.
>
>
>
> > On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas 
> wrote:
> >
> > Hi there,
> >
> > I have my data stored in HDFS partitioned by month in Parquet format.
> > The directory looks like this:
> >
> > -month=201411
> > -month=201412
> > -month=201501
> > -
> >
> > I want to compute some aggregates for every timestamp.
> > How is it possible to achieve that by taking advantage of the existing
> partitioning?
> > One naive way I am thinking is issuing multiple sql queries:
> >
> > SELECT * FROM TABLE WHERE month=201411
> > SELECT * FROM TABLE WHERE month=201412
> > SELECT * FROM TABLE WHERE month=201501
> > .
> >
> > computing the aggregates on the results of each query and combining them
> in the end.
> >
> > I think there should be a better way right?
> >
> > Thanks
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch
thx for mentioning the build requirement

But actually it is -*D*scala-2.11   (i.e. -D for java property instead of
-P for profile)

details:

We can see this in the pom.xml

   
  scala-2.11
  
scala-2.11
  
  
2.11.7
2.11
  


So the scala-2.11 profile is activated by detecting the scala-2.11 system
property being set



2015-11-24 10:01 GMT-08:00 Ted Yu <yuzhih...@gmail.com>:

> See also:
>
> https://repository.apache.org/content/repositories/orgapachespark-1162/org/apache/spark/spark-core_2.11/v1.6.0-preview2/
>
> w.r.t. building locally, please specify -Pscala-2.11
>
> Cheers
>
> On Tue, Nov 24, 2015 at 9:58 AM, Stephen Boesch <java...@gmail.com> wrote:
>
>> HI Madabhattula
>>  Scala 2.11 requires building from source.  Prebuilt binaries are
>> available only for scala 2.10
>>
>> From the src folder:
>>
>>dev/change-scala-version.sh 2.11
>>
>> Then build as you would normally either from mvn or sbt
>>
>> The above info *is* included in the spark docs but a little hard to find.
>>
>>
>>
>> 2015-11-24 9:50 GMT-08:00 Madabhattula Rajesh Kumar <mrajaf...@gmail.com>
>> :
>>
>>> Hi Ted,
>>>
>>> I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in
>>> above link.
>>>
>>> Regards,
>>> Rajesh
>>>
>>> On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> See:
>>>>
>>>> http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview
>>>>
>>>> On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar <
>>>> mrajaf...@gmail.com> wrote:
>>>>
>>>>> Hi Prem,
>>>>>
>>>>> Thank you for the details. I'm not able to build. I'm facing some
>>>>> issues.
>>>>>
>>>>> Any repository link, where I can download (preview version of)  1.6
>>>>> version of spark-core_2.11 and spark-sql_2.11 jar files.
>>>>>
>>>>> Regards,
>>>>> Rajesh
>>>>>
>>>>> On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure <premsure...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> you can refer..:
>>>>>> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
>>>>>> mrajaf...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm not able to build Spark 1.6 from source. Could you please share
>>>>>>> the steps to build Spark 1.16
>>>>>>>
>>>>>>> Regards,
>>>>>>> Rajesh
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch
HI Madabhattula
 Scala 2.11 requires building from source.  Prebuilt binaries are
available only for scala 2.10

>From the src folder:

   dev/change-scala-version.sh 2.11

Then build as you would normally either from mvn or sbt

The above info *is* included in the spark docs but a little hard to find.



2015-11-24 9:50 GMT-08:00 Madabhattula Rajesh Kumar :

> Hi Ted,
>
> I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in above
> link.
>
> Regards,
> Rajesh
>
> On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu  wrote:
>
>> See:
>>
>> http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview
>>
>> On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi Prem,
>>>
>>> Thank you for the details. I'm not able to build. I'm facing some
>>> issues.
>>>
>>> Any repository link, where I can download (preview version of)  1.6
>>> version of spark-core_2.11 and spark-sql_2.11 jar files.
>>>
>>> Regards,
>>> Rajesh
>>>
>>> On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure 
>>> wrote:
>>>
 you can refer..:
 https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn


 On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
 mrajaf...@gmail.com> wrote:

> Hi,
>
> I'm not able to build Spark 1.6 from source. Could you please share
> the steps to build Spark 1.16
>
> Regards,
> Rajesh
>


>>>
>>
>


Re: Spark-SQL idiomatic way of adding a new partition or writing to Partitioned Persistent Table

2015-11-22 Thread Stephen Boesch
>> and then use the  Hive's dynamic partitioned insert syntax

What does this entail?  Same sql but you need to do

   set  hive.exec.dynamic.partition = true;
in the hive/sql context  (along with several other related dynamic
partition settings.)

Is there anything else/special required?


2015-11-22 17:32 GMT-08:00 Deenar Toraskar :

> Thanks Michael
>
> Thanks for the response. Here is my understanding, correct me if I am wrong
>
> 1) Spark SQL written partitioned tables do not write metadata to the Hive
> metastore. Spark SQL discovers partitions from the table location on the
> underlying DFS, and not the metastore. It does this the first time a table
> is accessed, so if the underlying partitions change a refresh table
>  is required. Is there a way to see partitions discovered by
> Spark SQL, show partitions  does not work on Spark SQL
> partitioned tables. Also hive allows different partitions in different
> physical locations, I guess this wont be possibly in Spark SQL.
>
> 2) If you want to retain compatibility with other SQL on Hadoop engines,
> register your dataframe as a temp table and then use the  Hive's dynamic
> partitioned insert syntax. SparkSQL uses this for Hive style tables.
>
> 3) Automatic schema discovery. I presume this is parquet only and only if 
> spark.sql.parquet.mergeSchema
> / mergeSchema is set to true. What happens when mergeSchema is set to
> false ( i guess i can check this out).
>
> My two cents
>
> a) it would help if there was kind of the hive nonstrict mode equivalent,
> which would enforce schema compatibility for all partitions written to a
> table.
> b) refresh table is annoying for tables where partitions are being written
> frequently, for other reasons, not sure if there is way around this.
> c) it would be great if DataFrameWriter had an option to maintain
> compatibility with the HiveMetastore. registerTempTable and "insert
> overwrite table select from" is quite ugly and cumbersome
> d) It would be helpful to resurrect the
> https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/,
> happy to help out with the Spark SQL portions.
>
> Regards
> Deenar
>
>
> On 22 November 2015 at 18:54, Michael Armbrust 
> wrote:
>
>> Is it possible to add a new partition to a persistent table using Spark
>>> SQL ? The following call works and data gets written in the correct
>>> directories, but no partition metadata is not added to the Hive metastore.
>>>
>> I believe if you use Hive's dynamic partitioned insert syntax then we
>> will fall back on metastore and do the update.
>>
>>> In addition I see nothing preventing any arbitrary schema being appended
>>> to the existing table.
>>>
>> This is perhaps kind of a feature, we do automatic schema discovery and
>> merging when loading a new parquet table.
>>
>>> Does SparkSQL not need partition metadata when reading data back?
>>>
>> No, we dynamically discover it in a distributed job when the table is
>> loaded.
>>
>
>


Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
The following works against a hive table from spark sql

hc.sql("select id,r from (select id, name, rank()  over (order by name) as
r from tt2) v where v.r >= 1 and v.r <= 12")

But when using  a standard sql context against a temporary table the
following occurs:


Exception in thread "main" java.lang.RuntimeException: [3.25]
  failure: ``)'' expected but `(' found

rank() over (order by name) as r
^


Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
Checked out 1.6.0-SNAPSHOT 60 minutes ago

2015-11-18 19:19 GMT-08:00 Jack Yang <j...@uow.edu.au>:

> Which version of spark are you using?
>
>
>
> *From:* Stephen Boesch [mailto:java...@gmail.com]
> *Sent:* Thursday, 19 November 2015 2:12 PM
> *To:* user
> *Subject:* Do windowing functions require hive support?
>
>
>
>
>
> The following works against a hive table from spark sql
>
>
>
> hc.sql("select id,r from (select id, name, rank()  over (order by name) as
> r from tt2) v where v.r >= 1 and v.r <= 12")
>
>
>
> But when using  a standard sql context against a temporary table the
> following occurs:
>
>
>
>
>
> Exception in thread "main" java.lang.RuntimeException: [3.25]
>
>   failure: ``)'' expected but `(' found
>
>
>
> rank() over (order by name) as r
>
> ^
>
>


Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
But to focus the attention properly: I had already tried out 1.5.2.

2015-11-18 19:46 GMT-08:00 Stephen Boesch <java...@gmail.com>:

> Checked out 1.6.0-SNAPSHOT 60 minutes ago
>
> 2015-11-18 19:19 GMT-08:00 Jack Yang <j...@uow.edu.au>:
>
>> Which version of spark are you using?
>>
>>
>>
>> *From:* Stephen Boesch [mailto:java...@gmail.com]
>> *Sent:* Thursday, 19 November 2015 2:12 PM
>> *To:* user
>> *Subject:* Do windowing functions require hive support?
>>
>>
>>
>>
>>
>> The following works against a hive table from spark sql
>>
>>
>>
>> hc.sql("select id,r from (select id, name, rank()  over (order by name)
>> as r from tt2) v where v.r >= 1 and v.r <= 12")
>>
>>
>>
>> But when using  a standard sql context against a temporary table the
>> following occurs:
>>
>>
>>
>>
>>
>> Exception in thread "main" java.lang.RuntimeException: [3.25]
>>
>>   failure: ``)'' expected but `(' found
>>
>>
>>
>> rank() over (order by name) as r
>>
>> ^
>>
>>
>


Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
Why is the same query (and actually i tried several variations) working
against a hivecontext and not against the sql context?

2015-11-18 19:57 GMT-08:00 Michael Armbrust <mich...@databricks.com>:

> Yes they do.
>
> On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch <java...@gmail.com> wrote:
>
>> But to focus the attention properly: I had already tried out 1.5.2.
>>
>> 2015-11-18 19:46 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>>
>>> Checked out 1.6.0-SNAPSHOT 60 minutes ago
>>>
>>> 2015-11-18 19:19 GMT-08:00 Jack Yang <j...@uow.edu.au>:
>>>
>>>> Which version of spark are you using?
>>>>
>>>>
>>>>
>>>> *From:* Stephen Boesch [mailto:java...@gmail.com]
>>>> *Sent:* Thursday, 19 November 2015 2:12 PM
>>>> *To:* user
>>>> *Subject:* Do windowing functions require hive support?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> The following works against a hive table from spark sql
>>>>
>>>>
>>>>
>>>> hc.sql("select id,r from (select id, name, rank()  over (order by name)
>>>> as r from tt2) v where v.r >= 1 and v.r <= 12")
>>>>
>>>>
>>>>
>>>> But when using  a standard sql context against a temporary table the
>>>> following occurs:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Exception in thread "main" java.lang.RuntimeException: [3.25]
>>>>
>>>>   failure: ``)'' expected but `(' found
>>>>
>>>>
>>>>
>>>> rank() over (order by name) as r
>>>>
>>>> ^
>>>>
>>>>
>>>
>>
>


Re: Examples module not building in intellij

2015-10-04 Thread Stephen Boesch
Thanks Sean.  Why would a repo not be resolvable from IJ eve though all
modules build properly on the command line?

2015-10-04 2:47 GMT-07:00 Sean Owen <so...@cloudera.com>:

> It builds for me. That message usually really means you can't resolve
> or download from a repo. It's just the last thing that happens to
> fail.
>
> On Sun, Oct 4, 2015 at 7:06 AM, Stephen Boesch <java...@gmail.com> wrote:
> >
> > For a week or two the trunk has not been building for the examples module
> > within intellij. The other modules - including core, sql, mllib, etc are
> > working.
> >
> > A portion of the error message is
> >
> > "Unable to get dependency information: Unable to read the metadata file
> for
> > artifact 'org.eclipse.paho:org.eclipse.paho.client.mqttv3.jar"
> >
> > Any tips appreciated.
>


Examples module not building in intellij

2015-10-04 Thread Stephen Boesch
For a week or two the trunk has not been building for the examples module
within intellij. The other modules - including core, sql, mllib, etc *are *
working.

A portion of the error message is

"Unable to get dependency information: Unable to read the metadata file for
artifact 'org.eclipse.paho:org.eclipse.paho.client.mqttv3.jar"

Any tips appreciated.


Re: Breakpoints not hit with Scalatest + intelliJ

2015-09-18 Thread Stephen Boesch
Hi Michel, please try local[1] and report back if the breakpoint were hit.

2015-09-18 7:37 GMT-07:00 Michel Lemay :

> Hi,
>
> I'm adding unit tests to some utility functions that are using
> SparkContext but I'm unable to debug code and hit breakpoints when running
> under IntelliJ.
>
> My test creates a SparkContext from a SparkConf().setMaster("local[*]")...
>
> However, when I place a breakpoint in the functions inside a spark
> function, it's never hit.
>
> for instance:
>
> rdd.mapPartitions(p => {
>   myFct(...)
> })
>
> Any help would be appreciated.
>
>
>


Re: How to restrict java unit tests from the maven command line

2015-09-10 Thread Stephen Boesch
Yes, adding that flag does the trick. thanks.

2015-09-10 13:47 GMT-07:00 Sean Owen <so...@cloudera.com>:

> -Dtest=none ?
>
>
> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests
>
> On Thu, Sep 10, 2015 at 10:39 PM, Stephen Boesch <java...@gmail.com>
> wrote:
> >
> > I have invoked mvn test with the -DwildcardSuites option to specify a
> single
> > BinarizerSuite scalatest suite.
> >
> > The command line is
> >
> >mvn  -pl mllib  -Pyarn -Phadoop-2.6 -Dhadoop2.7.1 -Dscala-2.11
> > -Dmaven.javadoc.skip=true
> > -DwildcardSuites=org.apache.spark.ml.feature.BinarizerSuite test
> >
> > The scala side of affairs is correct: here is the relevant output
> >
> >
> > Results :
> >
> > Discovery starting.
> > Discovery completed in 2 seconds, 614 milliseconds.
> > Run starting. Expected test count is: 3
> > BinarizerSuite:
> > - params
> > - Binarize continuous features with default parameter
> > - Binarize continuous features with setter
> > Run completed in 6 seconds, 311 milliseconds.
> > Total number of tests run: 3
> >
> >
> > So we see only the one scala test suite was run- as intended.
> >
> > But on the java side it seems all of the tests within the mllib project
> were
> > run. Here is a snippet:
> >
> >
> >
> > Running org.apache.spark.ml.attribute.JavaAttributeGroupSuite
> > Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.156
> sec -
> > in org.apache.spark.ml.attribute.JavaAttributeGroupSuite
> > Running org.apache.spark.ml.attribute.JavaAttributeSuite
> > Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.018
> sec -
> > in org.apache.spark.ml.attribute.JavaAttributeSuite
> > Running
> org.apache.spark.ml.classification.JavaDecisionTreeClassifierSuite
> > Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.79 sec
> -
> > in
> >
> > .. dozens of similar ..
> >
> > Running org.apache.spark.mllib.tree.JavaDecisionTreeSuite
> > Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.287
> sec -
> > in org.apache.spark.mllib.tree.JavaDecisionTreeSuite
> > -- org.jblas INFO Deleting
> > /sparks/sparkup/mllib/target/tmp/jblas6038907640270970048/libjblas.dylib
> > -- org.jblas INFO Deleting
> >
> /sparks/sparkup/mllib/target/tmp/jblas6038907640270970048/libjblas_arch_flavor.dylib
> > -- org.jblas INFO Deleting
> > /sparks/sparkup/mllib/target/tmp/jblas6038907640270970048
> >
> > Results :
> >
> > Tests run: 106, Failures: 0, Errors: 0, Skipped: 0
> >
> > So what is the mvn option / setting to disable the java tests?
> >
> >
> > ...
>


How to restrict java unit tests from the maven command line

2015-09-10 Thread Stephen Boesch
I have invoked mvn test with the -DwildcardSuites option to specify a
single BinarizerSuite scalatest suite.

The command line is

   mvn  -pl mllib  -Pyarn -Phadoop-2.6 -Dhadoop2.7.1 -Dscala-2.11
-Dmaven.javadoc.skip=true
-DwildcardSuites=org.apache.spark.ml.feature.BinarizerSuite test

The scala side of affairs is correct: here is the relevant output


Results :

Discovery starting.
Discovery completed in 2 seconds, 614 milliseconds.
Run starting. Expected test count is: 3
BinarizerSuite:
- params
- Binarize continuous features with default parameter
- Binarize continuous features with setter
Run completed in 6 seconds, 311 milliseconds.
Total number of tests run: 3


So we see only the one scala test suite was run- as intended.

But on the java side it seems all of the tests within the mllib project
were run. Here is a snippet:



Running org.apache.spark.ml.attribute.JavaAttributeGroupSuite
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.156 sec -
in org.apache.spark.ml.attribute.JavaAttributeGroupSuite
Running org.apache.spark.ml.attribute.JavaAttributeSuite
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.018 sec -
in org.apache.spark.ml.attribute.JavaAttributeSuite
Running org.apache.spark.ml.classification.JavaDecisionTreeClassifierSuite
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.79 sec -
in

.. dozens of similar ..

Running org.apache.spark.mllib.tree.JavaDecisionTreeSuite
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.287 sec -
in org.apache.spark.mllib.tree.JavaDecisionTreeSuite
-- org.jblas INFO Deleting
/sparks/sparkup/mllib/target/tmp/jblas6038907640270970048/libjblas.dylib
-- org.jblas INFO Deleting
/sparks/sparkup/mllib/target/tmp/jblas6038907640270970048/libjblas_arch_flavor.dylib
-- org.jblas INFO Deleting
/sparks/sparkup/mllib/target/tmp/jblas6038907640270970048

Results :

Tests run: 106, Failures: 0, Errors: 0, Skipped: 0

So what is the mvn option / setting to disable the java tests?


...


Re: Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-17 Thread Stephen Boesch
In 1.4 it is change-scala-version.sh  2.11

But the problem was it is a -Dscala-211  not  a -P.  I misread the doc's.

2015-08-17 14:17 GMT-07:00 Ted Yu yuzhih...@gmail.com:

 You were building against 1.4.x, right ?

 In master branch, switch-to-scala-2.11.sh is gone. There is scala-2.11
 profile.

 FYI

 On Sun, Aug 16, 2015 at 11:12 AM, Stephen Boesch java...@gmail.com
 wrote:


 I am building spark with the following options - most notably the
 **scala-2.11**:

  . dev/switch-to-scala-2.11.sh
 mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests
 -Dmaven.javadoc.skip=true clean package


 The build goes pretty far but fails in one of the minor modules *repl*:

 [INFO]
 
 [ERROR] Failed to execute goal on project spark-repl_2.11: Could not
 resolve dependencies
 for project org.apache.spark:spark-repl_2.11:jar:1.5.0-SNAPSHOT:
  Could not   find artifact org.scala-lang:jline:jar:2.11.7 in central
  (https://repo1.maven.org/maven2) - [Help 1]

 Upon investigation - from 2.11.5 and later the scala version of jline is
 no longer required: they use the default jline distribution.

 And in fact the repl only shows dependency on jline for the 2.10.4 scala
 version:

 profile
   idscala-2.10/id
   activation
 propertyname!scala-2.11/name/property
   /activation
   properties
 scala.version2.10.4/scala.version
 scala.binary.version2.10/scala.binary.version
 jline.version${scala.version}/jline.version
 jline.groupidorg.scala-lang/jline.groupid
   /properties
   dependencyManagement
 dependencies
   dependency
 groupId${jline.groupid}/groupId
 artifactIdjline/artifactId
 version${jline.version}/version
   /dependency
 /dependencies
   /dependencyManagement
 /profile

 So then it is not clear why this error is occurring. Pointers appreciated.






Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-16 Thread Stephen Boesch
I am building spark with the following options - most notably the
**scala-2.11**:

 . dev/switch-to-scala-2.11.sh
mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests
-Dmaven.javadoc.skip=true clean package


The build goes pretty far but fails in one of the minor modules *repl*:

[INFO]

[ERROR] Failed to execute goal on project spark-repl_2.11: Could not
resolve dependencies
for project org.apache.spark:spark-repl_2.11:jar:1.5.0-SNAPSHOT:
 Could not   find artifact org.scala-lang:jline:jar:2.11.7 in central
 (https://repo1.maven.org/maven2) - [Help 1]

Upon investigation - from 2.11.5 and later the scala version of jline is no
longer required: they use the default jline distribution.

And in fact the repl only shows dependency on jline for the 2.10.4 scala
version:

profile
  idscala-2.10/id
  activation
propertyname!scala-2.11/name/property
  /activation
  properties
scala.version2.10.4/scala.version
scala.binary.version2.10/scala.binary.version
jline.version${scala.version}/jline.version
jline.groupidorg.scala-lang/jline.groupid
  /properties
  dependencyManagement
dependencies
  dependency
groupId${jline.groupid}/groupId
artifactIdjline/artifactId
version${jline.version}/version
  /dependency
/dependencies
  /dependencyManagement
/profile

So then it is not clear why this error is occurring. Pointers appreciated.


Re: Error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-14 Thread Stephen Boesch
The NoClassDefFoundException differs from ClassNotFoundException : it
indicates an error while initializing that class: but the class is found in
the classpath. Please provide the full stack trace.

2015-08-14 4:59 GMT-07:00 stelsavva stel...@avocarrot.com:

 Hello, I am just starting out with spark streaming and Hbase/hadoop, i m
 writing a simple app to read from kafka and store to Hbase, I am having
 trouble submitting my job to spark.

 I 've downloaded Apache Spark 1.4.1 pre-build for hadoop 2.6

 I am building the project with mvn package

 and submitting the jar file with

  ~/Desktop/spark/bin/spark-submit --class org.example.main.scalaConsumer
 scalConsumer-0.0.1-SNAPSHOT.jar

 And then i am getting the error you see in the subject line. Is this a
 problem with my maven dependencies? do i need to install hadoop locally?
 And
 if so how can i add the hadoop classpath to the spark job?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-Exception-in-thread-main-java-lang-NoClassDefFoundError-org-apache-hadoop-hbase-HBaseConfiguran-tp24266.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark-submit not finding main class and the error reflects different path to jar file than specified

2015-08-06 Thread Stephen Boesch
Given the following command line to spark-submit:

bin/spark-submit --verbose --master local[2]--class
org.yardstick.spark.SparkCoreRDDBenchmark
/shared/ysgood/target/yardstick-spark-uber-0.0.1.jar

Here is the output:

NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes
ahead of assembly.
Using properties file: /shared/spark-1.4.1/conf/spark-defaults.conf
Adding default property: spark.akka.askTimeout=180
Adding default property: spark.master=spark://mellyrn.local:7077
Error: Cannot load main class from JAR
file:/shared/spark-1.4.1/org.yardstick.spark.SparkCoreRDDBenchmark
Run with --help for usage help or --verbose for debug output


The path
file:/shared/spark-1.4.1/org.yardstick.spark.SparkCoreRDDBenchmark  does
not seem to make sense. It  does not reflect the path to the file that was
specified on the submit-spark command line.

Note: when attempting to run that jar file via

java -classpath shared/ysgood/target/yardstick-spark-uber-0.0.1.jar
org.yardstick.spark.SparkCoreRDDBenchmark

Then the result is as expected: the main class starts to load and then
there is a NoClassDefFoundException on the SparkConf.classs (which is not
inside the jar). This shows the app jar is healthy.


Which directory contains third party libraries for Spark

2015-07-27 Thread Stephen Boesch
when using spark-submit: which directory contains third party libraries
that will be loaded on each of the slaves? I would like to scp one or more
libraries to each of the slaves instead of shipping the contents in the
application uber-jar.

Note: I did try adding to $SPARK_HOME/lib_managed/jars.   But the
spark-submit still results in a ClassNotFoundException for classes included
in the added library.


Re: spark benchmarking

2015-07-08 Thread Stephen Boesch
One  option is the databricks/spark-perf project
https://github.com/databricks/spark-perf

2015-07-08 11:23 GMT-07:00 MrAsanjar . afsan...@gmail.com:

 Hi all,
 What is the most common used tool/product to benchmark spark job?



Catalyst Errors when building spark from trunk

2015-07-07 Thread Stephen Boesch
The following errors are occurring upon building using mvn options  clean
package

Are there some requirements/restrictions on profiles/settings for catalyst
to build properly?

[error]
/shared/sparkup2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:138:
value length is not a member of org.apache.spark.unsafe.types.UTF8String
[error]   buildCast[UTF8String](_, _.length() != 0)
[error]  ^
[error]
/shared/sparkup2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala:282:
value length is not a member of org.apache.spark.unsafe.types.UTF8String
[error]   val (st, end) = slicePos(start, length, () = s.length())
[error]   ^
[error]
/shared/sparkup2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala:283:
type mismatch;
[error]  found   : Any
[error]  required: Int
[error]   s.substring(st, end)
[error]   ^
[error]
/shared/sparkup2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala:283:
type mismatch;
[error]  found   : Any
[error]  required: Int
[error]   s.substring(st, end)
[error]   ^
[error]
/shared/sparkup2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala:304:
value length is not a member of org.apache.spark.unsafe.types.UTF8String
[error] if (string == null) null else
string.asInstanceOf[UTF8String].length
[error]   ^
[warn] three warnings found
[error] 5 errors found
[error] Compile failed at Jul 7, 2015 9:43:44 PM [19.378s]


Re: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Stephen Boesch
Vanilla map/reduce does not expose it: but hive on top of map/reduce has
superior partitioning (and bucketing) support to Spark.

2015-06-28 13:44 GMT-07:00 Koert Kuipers ko...@tresata.com:

 spark is partitioner aware, so it can exploit a situation where 2 datasets
 are partitioned the same way (for example by doing a map-side join on
 them). map-red does not expose this.

 On Sun, Jun 28, 2015 at 12:13 PM, YaoPau jonrgr...@gmail.com wrote:

 I've heard Spark is not just MapReduce mentioned during Spark talks,
 but it
 seems like every method that Spark has is really doing something like (Map
 - Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with
 the
 performance benefit of keeping RDDs in memory between stages.

 Am I wrong about that?  Is Spark doing anything more efficiently than a
 series of Maps followed by a Reduce in memory?  What methods does Spark
 have
 that can't easily be mapped (with somewhat similar efficiency) to Map and
 Reduce in memory?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/What-does-Spark-is-not-just-MapReduce-mean-Isn-t-every-Spark-job-a-form-of-MapReduce-tp23518.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Velox Model Server

2015-06-21 Thread Stephen Boesch
Oryx 2 has a scala client


https://github.com/OryxProject/oryx/blob/master/framework/oryx-api/src/main/scala/com/cloudera/oryx/api/




2015-06-20 11:39 GMT-07:00 Debasish Das debasish.da...@gmail.com:

 After getting used to Scala, writing Java is too much work :-)

 I am looking for scala based project that's using netty at its core (spray
 is one example).

 prediction.io is an option but that also looks quite complicated and not
 using all the ML features that got added in 1.3/1.4

 Velox built on top of ML / Keystone ML pipeline API and that's useful but
 it is still using javax servlets which is not netty based.

 On Sat, Jun 20, 2015 at 10:25 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Oops, that link was for Oryx 1. Here's the repo for Oryx 2:
 https://github.com/OryxProject/oryx

 On Sat, Jun 20, 2015 at 10:20 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Debasish,

 The Oryx project (https://github.com/cloudera/oryx), which is Apache 2
 licensed, contains a model server that can serve models built with MLlib.

 -Sandy

 On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl charles.ce...@gmail.com
 wrote:

 Is velox NOT open source?


 On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 The demo of end-to-end ML pipeline including the model server
 component at Spark Summit was really cool.

 I was wondering if the Model Server component is based upon Velox or
 it uses a completely different architecture.

 https://github.com/amplab/velox-modelserver

 We are looking for an open source version of model server to build
 upon.

 Thanks.
 Deb



 --
 - Charles







Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-01 Thread Stephen Boesch
I downloaded the 1.3.1 distro tarball

$ll ../spark-1.3.1.tar.gz
-rw-r-@ 1 steve  staff  8500861 Apr 23 09:58 ../spark-1.3.1.tar.gz

However the build on it is failing with an unresolved dependency:
*configuration
not public*

$ build/sbt   assembly -Dhadoop.version=2.5.2 -Pyarn -Phadoop-2.4

[error] (network-shuffle/*:update) sbt.ResolveException: *unresolved
dependency: *org.apache.spark#spark-network-common_2.10;1.3.1: *configuration
not public* in org.apache.spark#spark-network-common_2.10;1.3.1: 'test'. It
was required from org.apache.spark#spark-network-shuffle_2.10;1.3.1 test

Is there a known workaround for this?

thanks


Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Stephen Boesch
Thanks Yana,

   My current experience here is after running some small spark-submit
based tests the Master once again stopped being reachable.  No change in
the test setup.  I restarted Master/Worker and still not reachable.

What might be the variables here in which association with the
Master/Worker stops succeedng?

For reference here are the Master/worker


  501 34465 1   0 11:35AM ?? 0:06.50
/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java
-cp classpath.. -Xms512m -Xmx512m -XX:MaxPermSize=128m
org.apache.spark.deploy.worker.Worker spark://mellyrn.local:7077
  501 34361 1   0 11:35AM ttys0180:07.08
/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java
-cp classpath..  -Xms512m -Xmx512m -XX:MaxPermSize=128m
org.apache.spark.deploy.master.Master --ip mellyrn.local --port 7077
--webui-port 8080


15/05/27 11:36:37 INFO SparkUI: Started SparkUI at http://25.101.19.24:4040
15/05/27 11:36:37 INFO SparkContext: Added JAR
file:/shared/spark-perf/mllib-tests/target/mllib-perf-tests-assembly.jar at
http://25.101.19.24:60329/jars/mllib-perf-tests-assembly.jar with timestamp
1432751797662
15/05/27 11:36:37 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
15/05/27 11:36:37 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster@mellyrn.local:7077: akka.remote.InvalidAssociation:
Invalid address: akka.tcp://sparkMaster@mellyrn.local:7077
15/05/27 11:36:37 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated
for 5000 ms, all messages to this address will be delivered to dead
letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
15/05/27 11:36:57 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
15/05/27 11:36:57 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster@mellyrn.local:7077: akka.remote.InvalidAssociation:
Invalid address: akka.tcp://sparkMaster@mellyrn.local:7077
15/05/27 11:36:57 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated
for 5000 ms, all messages to this address will be delivered to dead
letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
15/05/27 11:37:17 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
15/05/27 11:37:17 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster@mellyrn.local:7077: akka.remote.InvalidAssociation:
Invalid address: akka.tcp://sparkMaster@mellyrn.local:7077
15/05/27 11:37:17 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated
for 5000 ms, all messages to this address will be delivered to dead
letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
15/05/27 11:37:37 ERROR SparkDeploySchedulerBackend: Application has been
killed. Reason: All masters are unresponsive! Giving up.
15/05/27 11:37:37 WARN SparkDeploySchedulerBackend: Application ID is not
initialized yet.
1


Even when successful, the time for the Master to come up has a surprisingly
high variance. I am running on a single machine for which there is plenty
of RAM. Note that was one problem before the present series :  if RAM is
tight then the failure modes can be unpredictable. But now the RAM is not
an issue: plenty available for both Master and Worker.

Within the same hour period and starting/stopping maybe a dozen times, the
startup time for the Master may be a few seconds up to  a couple to several
minutes.

2015-05-20 7:39 GMT-07:00 Yana Kadiyska yana.kadiy...@gmail.com:

 But if I'm reading his email correctly he's saying that:

 1. The master and slave are on the same box (so network hiccups are
 unlikely culprit)
 2. The failures are intermittent -- i.e program works for a while then
 worker gets disassociated...

 Is it possible that the master restarted? We used to have problems like
 this where we'd restart the master process, it won't be listening on 7077
 for some time, but the worker process is trying to connect and by the time
 the master is up the worker has given up...


 On Wed, May 20, 2015 at 5:16 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 Check whether the name can be resolved in the /etc/hosts file (or DNS) of
 the worker



 (the same btw applies for the Node where you run the driver app – all
 other nodes must be able to resolve its name)



 *From:* Stephen Boesch [mailto:java...@gmail.com]
 *Sent:* Wednesday, May 20, 2015 10:07 AM
 *To:* user
 *Subject:* Intermittent difficulties for Worker to contact Master on
 same machine in standalone





 What conditions would cause the following delays / failure for a
 standalone machine/cluster to have the Worker contact the Master?



 15/05/20 02:02:53 INFO WorkerWebUI

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Stephen Boesch
Here is example after git clone-ing latest 1.4.0-SNAPSHOT.  The first 3
runs (FINISHED) were successful and connected quickly.  Fourth run (ALIVE)
is failing on connection/association.


URL: spark://mellyrn.local:7077
REST URL: spark://mellyrn.local:6066 (cluster mode)
Workers: 1
Cores: 8 Total, 0 Used
Memory: 15.0 GB Total, 0.0 B Used
Applications: 0 Running, 3 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Workers

Worker Id Address ▾ State Cores Memory
worker-20150527122155-10.0.0.3-60847 10.0.0.3:60847 ALIVE 8 (0 Used) 15.0
GB (0.0 B Used)
Running Applications

Application ID Name Cores Memory per Node Submitted Time User State Duration
Completed Applications

Application ID Name Cores Memory per Node Submitted Time User State Duration
app-20150527125945-0002 TestRunner: power-iteration-clustering 8 512.0
MB 2015/05/27
12:59:45 steve FINISHED 7 s
app-20150527124403-0001 TestRunner: power-iteration-clustering 8 512.0
MB 2015/05/27
12:44:03 steve FINISHED 6 s
app-20150527123822- TestRunner: power-iteration-clustering 8 512.0
MB 2015/05/27
12:38:22 steve FINISHED 6 s



2015-05-27 11:42 GMT-07:00 Stephen Boesch java...@gmail.com:

 Thanks Yana,

My current experience here is after running some small spark-submit
 based tests the Master once again stopped being reachable.  No change in
 the test setup.  I restarted Master/Worker and still not reachable.

 What might be the variables here in which association with the
 Master/Worker stops succeedng?

 For reference here are the Master/worker


   501 34465 1   0 11:35AM ?? 0:06.50
 /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java
 -cp classpath.. -Xms512m -Xmx512m -XX:MaxPermSize=128m
 org.apache.spark.deploy.worker.Worker spark://mellyrn.local:7077
   501 34361 1   0 11:35AM ttys0180:07.08
 /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java
 -cp classpath..  -Xms512m -Xmx512m -XX:MaxPermSize=128m
 org.apache.spark.deploy.master.Master --ip mellyrn.local --port 7077
 --webui-port 8080


 15/05/27 11:36:37 INFO SparkUI: Started SparkUI at
 http://25.101.19.24:4040
 15/05/27 11:36:37 INFO SparkContext: Added JAR
 file:/shared/spark-perf/mllib-tests/target/mllib-perf-tests-assembly.jar at
 http://25.101.19.24:60329/jars/mllib-perf-tests-assembly.jar with
 timestamp 1432751797662
 15/05/27 11:36:37 INFO AppClient$ClientActor: Connecting to master
 akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
 15/05/27 11:36:37 WARN AppClient$ClientActor: Could not connect to
 akka.tcp://sparkMaster@mellyrn.local:7077:
 akka.remote.InvalidAssociation: Invalid address:
 akka.tcp://sparkMaster@mellyrn.local:7077
 15/05/27 11:36:37 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
 now gated for 5000 ms, all messages to this address will be delivered to
 dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
 15/05/27 11:36:57 INFO AppClient$ClientActor: Connecting to master
 akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
 15/05/27 11:36:57 WARN AppClient$ClientActor: Could not connect to
 akka.tcp://sparkMaster@mellyrn.local:7077:
 akka.remote.InvalidAssociation: Invalid address:
 akka.tcp://sparkMaster@mellyrn.local:7077
 15/05/27 11:36:57 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
 now gated for 5000 ms, all messages to this address will be delivered to
 dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
 15/05/27 11:37:17 INFO AppClient$ClientActor: Connecting to master
 akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
 15/05/27 11:37:17 WARN AppClient$ClientActor: Could not connect to
 akka.tcp://sparkMaster@mellyrn.local:7077:
 akka.remote.InvalidAssociation: Invalid address:
 akka.tcp://sparkMaster@mellyrn.local:7077
 15/05/27 11:37:17 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
 now gated for 5000 ms, all messages to this address will be delivered to
 dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
 15/05/27 11:37:37 ERROR SparkDeploySchedulerBackend: Application has been
 killed. Reason: All masters are unresponsive! Giving up.
 15/05/27 11:37:37 WARN SparkDeploySchedulerBackend: Application ID is not
 initialized yet.
 1


 Even when successful, the time for the Master to come up has a
 surprisingly high variance. I am running on a single machine for which
 there is plenty of RAM. Note that was one problem before the present series
 :  if RAM is tight then the failure modes can be unpredictable. But now the
 RAM is not an issue: plenty available for both Master and Worker.

 Within the same hour period and starting/stopping maybe a dozen times, the
 startup time for the Master may be a few seconds up to  a couple to several
 minutes.

 2015-05-20 7:39 GMT-07:00

Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Stephen Boesch
What conditions would cause the following delays / failure for a standalone
machine/cluster to have the Worker contact the Master?

15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at
http://10.0.0.3:8081
15/05/20 02:02:53 INFO Worker: Connecting to master
akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
15/05/20 02:02:53 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated
for 5000 ms, all messages to this address will be delivered to dead
letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077
15/05/20 02:03:04 INFO Worker: Retrying connection to master (attempt # 1)
..
..
15/05/20 02:03:26 INFO Worker: Retrying connection to master (attempt # 3)
15/05/20 02:03:26 INFO Worker: Connecting to master
akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
15/05/20 02:03:26 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated
for 5000 ms, all messages to this address will be delivered to dead
letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077


Re: Code error

2015-05-19 Thread Stephen Boesch
Hi Ricardo,
 providing the error output would help . But in any case you need to do a
collect() on the rdd returned from computeCost.

2015-05-19 11:59 GMT-07:00 Ricardo Goncalves da Silva 
ricardog.si...@telefonica.com:

  Hi,



 Can anybody see what’s wrong in this piece of code:





 ./bin/spark-shell --num-executors 2 --executor-memory 512m --master
 yarn-client

 import org.apache.spark.mllib.clustering.KMeans

 import org.apache.spark.mllib.linalg.Vectors





 val data = sc.textFile(/user/p_loadbd/fraude5.csv).map(x =
 x.toLowerCase.split(',')).map(x = x(0)+,+x(1))

 val header = data.first()

 val filter_data = data.filter(x = x != header)

 val parsedData = data.map(s =
 Vectors.dense(s.split(',').map(_.toDouble))).cache()



 val numClusters = 2

 val numIterations = 20

 val clusters = KMeans.train(parsedData, numClusters, numIterations)



 val WSSSE = clusters.computeCost(parsedData)

 println(Within Set Sum of Squared Errors =  + WSSSE)



 Thanks.





 [image: Descrição: Descrição: Descrição:
 cid:image002.jpg@01CC89A8.2B628650]

 *Ricardo Goncalves da Silva*
 Lead Data Scientist *|* Seção de Desenvolvimento de Sistemas de

 Business Intelligence – Projetos de Inovação *| *IDPB02

 Av. Eng. Luis Carlos Berrini, 1.376 – 7º – 04571-000 - SP

 ricardog.si...@telefonica.com *|* www.telefonica.com.br

 Tel +55 11 3430 4955 *| *Cel +55 11 94292 9526





 --

 Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
 puede contener información privilegiada o confidencial y es para uso
 exclusivo de la persona o entidad de destino. Si no es usted. el
 destinatario indicado, queda notificado de que la lectura, utilización,
 divulgación y/o copia sin autorización puede estar prohibida en virtud de
 la legislación vigente. Si ha recibido este mensaje por error, le rogamos
 que nos lo comunique inmediatamente por esta misma vía y proceda a su
 destrucción.

 The information contained in this transmission is privileged and
 confidential information intended only for the use of the individual or
 entity named above. If the reader of this message is not the intended
 recipient, you are hereby notified that any dissemination, distribution or
 copying of this communication is strictly prohibited. If you have received
 this transmission in error, do not read it. Please immediately reply to the
 sender that you have received this communication in error and then delete
 it.

 Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
 pode conter informação privilegiada ou confidencial e é para uso exclusivo
 da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
 indicado, fica notificado de que a leitura, utilização, divulgação e/ou
 cópia sem autorização pode estar proibida em virtude da legislação vigente.
 Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique
 imediatamente por esta mesma via e proceda a sua destruição



Re: Building Spark

2015-05-13 Thread Stephen Boesch
Hi Akhil,   Building with sbt tends to need around 3.5GB whereas maven
requirements are much lower , around 1.7GB. So try using maven .

For reference I have the following settings and both do compile.  sbt would
not work with lower values.


$echo $SBT_OPTS
-Xmx3012m -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m
$echo $MAVEN_OPTS
-Xmx1280m -XX:MaxPermSize=384m

2015-05-13 5:57 GMT-07:00 Heisenberg Bb hbbalg...@gmail.com:

 I tried to build Spark in my local machine Ubuntu 14.04 ( 4 GB Ram), my
 system  is getting hanged (freezed). When I monitered system processes, the
 build process is found to consume 85% of my memory. Why does it need lot of
 resources. Is there any efficient method to build Spark.

 Thanks
 Akhil



  1   2   >