Re: Spark MySQL Invalid DateTime value killing job

2019-06-05 Thread Anthony May
Murphy's Law striking after asking the question, I just discovered the
solution:
The jdbc url should set the zeroDateTimeBehavior option.
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html
https://stackoverflow.com/questions/11133759/-00-00-00-can-not-be-represented-as-java-sql-timestamp-error

On Wed, Jun 5, 2019 at 6:29 PM Anthony May  wrote:

> Hi,
>
> We have a legacy process of scraping a MySQL Database. The Spark job uses
> the DataFrame API and MySQL JDBC driver to read the tables and save them as
> JSON files. One table has DateTime columns that contain values invalid for
> java.sql.Timestamp so it's throwing the exception:
> java.sql.SQLException: Value '-00-00 00:00:00' can not be represented
> as java.sql.Timestamp
>
> Unfortunately, I can't edit the values in the table to make them valid.
> There doesn't seem to be a way to specify row level exception handling in
> the DataFrame API. Is there a way to handle this that would scale for
> hundreds of tables?
>
> Any help is appreciated.
>
> Anthony
>


Spark MySQL Invalid DateTime value killing job

2019-06-05 Thread Anthony May
Hi,

We have a legacy process of scraping a MySQL Database. The Spark job uses
the DataFrame API and MySQL JDBC driver to read the tables and save them as
JSON files. One table has DateTime columns that contain values invalid for
java.sql.Timestamp so it's throwing the exception:
java.sql.SQLException: Value '-00-00 00:00:00' can not be represented
as java.sql.Timestamp

Unfortunately, I can't edit the values in the table to make them valid.
There doesn't seem to be a way to specify row level exception handling in
the DataFrame API. Is there a way to handle this that would scale for
hundreds of tables?

Any help is appreciated.

Anthony


Re: Spark scala development in Sbt vs Maven

2018-03-05 Thread Anthony May
We use sbt for easy cross project dependencies with multiple scala versions
in a mono-repo for which it pretty good albeit with some quirks. As our
projects have matured and change less we moved away from cross project
dependencies but it was extremely useful early in the projects. We knew
that a lot of this was possible in maven/gradle but did not want to go
through the hackage required to get it working.

On Mon, 5 Mar 2018 at 09:49 Sean Owen  wrote:

> Spark uses Maven as the primary build, but SBT works as well. It reads the
> Maven build to some extent.
>
> Zinc incremental compilation works with Maven (with the Scala plugin for
> Maven).
>
> Myself, I prefer Maven, for some of the reasons it is the main build in
> Spark: declarative builds end up being a good thing. You want builds very
> standard. I think the flexibility of writing code to express your build
> just gives a lot of rope to hang yourself with, and recalls the old days of
> Ant builds, where no two builds you'd encounter looked alike when doing the
> same thing.
>
> If by cross publishing you mean handling different scala versions, yeah
> SBT is more aware of that. The Spark Maven build manages to handle that
> with some hacking.
>
>
> On Mon, Mar 5, 2018 at 9:56 AM Jörn Franke  wrote:
>
>> I think most of the scala development in Spark happens with sbt - in the
>> open source world.
>>
>>  However, you can do it with Gradle and Maven as well. It depends on your
>> organization etc. what is your standard.
>>
>> Some things might be more cumbersome too reach in non-sbt scala
>> scenarios, but this is more and more improving.
>>
>> > On 5. Mar 2018, at 16:47, Swapnil Shinde 
>> wrote:
>> >
>> > Hello
>> >SBT's incremental compilation was a huge plus to build spark+scala
>> applications in SBT for some time. It seems Maven can also support
>> incremental compilation with Zinc server. Considering that, I am interested
>> to know communities experience -
>> >
>> > 1. Spark documentation says SBT is being used by many contributors for
>> day to day development mainly because of incremental compilation.
>> Considering Maven is supporting incremental compilation through Zinc, do
>> contributors prefer to change from SBT to maven?
>> >
>> > 2. Any issues /learning experiences with Maven + Zinc?
>> >
>> > 3. Any other reasons to use SBT over Maven for scala development.
>> >
>> > I understand SBT has many other advantages over Maven like cross
>> version publishing etc. but incremental compilation is major need for us. I
>> am more interested to know why Spark contributors/committers prefer SBT for
>> day to day development.
>> >
>> > Any help and advice would help us to direct our evaluations in right
>> direction,
>> >
>> > Thanks
>> > Swapnil
>>
>> -
>>
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Starting a new Spark codebase, Python or Scala / Java?

2016-11-21 Thread Anthony May
A sensible default strategy is to use the same language in which a system
was developed or a highly compatible language. That would be Scala for
Spark, however I assume you don't currently know Scala to the same degree
as Python or at all. In which case to help you make the decision you should
also consider your own personal/team productivity and project constraints.
If you have time and/or require the bleeding edge features and performance
then learning/strengthening in Scala is worth it and you should use the
Scala API.
If you're already very productive in Python and have tighter time
constraints and don't need the bleeding edge features and maximum
performance isn't a high priority then I'd recommend using the Python API.

On Mon, 21 Nov 2016 at 11:58 Jon Gregg  wrote:

> Spark is written in Scala, so yes it's still the strongest option.  You
> also get the Dataset type with Scala (compile time type-safety), and that's
> not an available feature with Python.
>
> That said, I think the Python API is a viable candidate if you use Pandas
> for Data Science.  There are similarities between the DataFrame and Pandas
> APIs, and you can convert a Spark DataFrame to a Pandas DataFrame.
>
> On Mon, Nov 21, 2016 at 1:51 PM, Brandon White 
> wrote:
>
> Hello all,
>
> I will be starting a new Spark codebase and I would like to get opinions
> on using Python over Scala. Historically, the Scala API has always been the
> strongest interface to Spark. Is this still true? Are there still many
> benefits and additional features in the Scala API that are not available in
> the Python API? Are there any performance concerns using the Python API
> that do not exist when using the Scala API? Anything else I should know
> about?
>
> I appreciate any insight you have on using the Scala API over the Python
> API.
>
> Brandon
>
>
>


Re: Question on Spark shell

2016-07-11 Thread Anthony May
I see. The title of your original email was "Spark Shell" which is a Spark
REPL environment based on the Scala Shell, hence why I misunderstood you.

You should have the same output starting the application on the console.
You are not seeing any output?

On Mon, 11 Jul 2016 at 11:55 Sivakumaran S <siva.kuma...@me.com> wrote:

> I am running a spark streaming application using Scala in the IntelliJ
> IDE. I can see the Spark output in the IDE itself (aggregation and stuff).
> I want the spark server logging (INFO, WARN, etc) to be displayed in screen
> when I start the master in the console. For example, when I start a kafka
> cluster, the prompt is not returned and the debug log is printed to the
> terminal. I want that set up with my spark server.
>
> I hope that explains my retrograde requirement :)
>
>
>
> On 11-Jul-2016, at 6:49 PM, Anthony May <anthony...@gmail.com> wrote:
>
> Starting the Spark Shell gives you a Spark Context to play with straight
> away. The output is printed to the console.
>
> On Mon, 11 Jul 2016 at 11:47 Sivakumaran S <siva.kuma...@me.com> wrote:
>
>> Hello,
>>
>> Is there a way to start the spark server with the log output piped to
>> screen? I am currently running spark in the standalone mode on a single
>> machine.
>>
>> Regards,
>>
>> Sivakumaran
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Question on Spark shell

2016-07-11 Thread Anthony May
Starting the Spark Shell gives you a Spark Context to play with straight
away. The output is printed to the console.

On Mon, 11 Jul 2016 at 11:47 Sivakumaran S  wrote:

> Hello,
>
> Is there a way to start the spark server with the log output piped to
> screen? I am currently running spark in the standalone mode on a single
> machine.
>
> Regards,
>
> Sivakumaran
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: JDBC Create Table

2016-05-27 Thread Anthony May
Hi Andrés,

What error are you seeing? Can you paste the stack trace?

Anthony

On Fri, 27 May 2016 at 08:37 Andrés Ivaldi  wrote:

> Hello, yesterday I updated Spark 1.6.0 to 1.6.1 and my tests starts to
> fail because is not possible create new tables in SQLServer, I'm using
> SaveMode.Overwrite as in 1.6.0 version
>
> Any Idea
>
> regards
>
> --
> Ing. Ivaldi Andres
>


Re: Insert into JDBC

2016-05-26 Thread Anthony May
It's on the 1.6 branch
On Thu, May 26, 2016 at 4:43 PM Andrés Ivaldi <iaiva...@gmail.com> wrote:

> I see, I'm using Spark 1.6.0 and that change seems to be for 2.0 or maybe
> it's in 1.6.1 looking at the history.
> thanks I'll see if update spark  to 1.6.1
>
> On Thu, May 26, 2016 at 3:33 PM, Anthony May <anthony...@gmail.com> wrote:
>
>> It doesn't appear to be configurable, but it is inserting by column name:
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L102
>>
>> On Thu, 26 May 2016 at 16:02 Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>
>>> Hello,
>>>  I'realize that when dataframe executes insert it is inserting by scheme
>>> order column instead by name, ie
>>>
>>> dataframe.write(SaveMode).jdbc(url, table, properties)
>>>
>>> Reading the profiler the execution is
>>>
>>> insert into TableName values(a,b,c..)
>>>
>>> what i need is
>>> insert into TableNames (colA,colB,colC) values(a,b,c)
>>>
>>> could be some configuration?
>>>
>>> regards.
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>
>
> --
> Ing. Ivaldi Andres
>


Re: Insert into JDBC

2016-05-26 Thread Anthony May
It doesn't appear to be configurable, but it is inserting by column name:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L102

On Thu, 26 May 2016 at 16:02 Andrés Ivaldi  wrote:

> Hello,
>  I'realize that when dataframe executes insert it is inserting by scheme
> order column instead by name, ie
>
> dataframe.write(SaveMode).jdbc(url, table, properties)
>
> Reading the profiler the execution is
>
> insert into TableName values(a,b,c..)
>
> what i need is
> insert into TableNames (colA,colB,colC) values(a,b,c)
>
> could be some configuration?
>
> regards.
>
> --
> Ing. Ivaldi Andres
>


Re: Tracking / estimating job progress

2016-05-13 Thread Anthony May
It looks like it might only be available via REST,
http://spark.apache.org/docs/latest/monitoring.html#rest-api

On Fri, 13 May 2016 at 11:24 Dood@ODDO <oddodao...@gmail.com> wrote:

> On 5/13/2016 10:16 AM, Anthony May wrote:
> >
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker
> >
> > Might be useful
>
> How do you use it? You cannot instantiate the class - is the constructor
> private? Thanks!
>
> >
> > On Fri, 13 May 2016 at 11:11 Ted Yu <yuzhih...@gmail.com
> > <mailto:yuzhih...@gmail.com>> wrote:
> >
> > Have you looked
> > at
> core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
> > ?
> >
> > Cheers
> >
> > On Fri, May 13, 2016 at 10:05 AM, Dood@ODDO <oddodao...@gmail.com
> > <mailto:oddodao...@gmail.com>> wrote:
> >
> > I provide a RESTful API interface from scalatra for launching
> > Spark jobs - part of the functionality is tracking these jobs.
> > What API is available to track the progress of a particular
> > spark application? How about estimating where in the total job
> > progress the job is?
> >
> > Thanks!
> >
> >
>  -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org
> > <mailto:user-h...@spark.apache.org>
> >
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Tracking / estimating job progress

2016-05-13 Thread Anthony May
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker

Might be useful

On Fri, 13 May 2016 at 11:11 Ted Yu  wrote:

> Have you looked
> at core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ?
>
> Cheers
>
> On Fri, May 13, 2016 at 10:05 AM, Dood@ODDO  wrote:
>
>> I provide a RESTful API interface from scalatra for launching Spark jobs
>> - part of the functionality is tracking these jobs. What API is available
>> to track the progress of a particular spark application? How about
>> estimating where in the total job progress the job is?
>>
>> Thanks!
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: killing spark job which is submitted using SparkSubmit

2016-05-06 Thread Anthony May
Making the master yarn-cluster means that the driver is then running on
YARN not just the executor nodes. It's then independent of your application
and can only be killed via YARN commands, or if it's batch and completes.
The simplest way to tie the driver to your app is to pass in yarn-client as
master instead.
On Fri, May 6, 2016 at 2:00 PM satish saley <satishsale...@gmail.com> wrote:

> Hi Anthony,
>
> I am passing
>
> --master
> yarn-cluster
> --name
> pysparkexample
> --executor-memory
> 1G
> --driver-memory
> 1G
> --conf
> spark.yarn.historyServer.address=http://localhost:18080
> --conf
> spark.eventLog.enabled=true
>
> --verbose
>
> pi.py
>
>
> I am able to run the job successfully. I just want to get it killed 
> automatically whenever I kill my application.
>
>
> On Fri, May 6, 2016 at 11:58 AM, Anthony May <anthony...@gmail.com> wrote:
>
>> Greetings Satish,
>>
>> What are the arguments you're passing in?
>>
>> On Fri, 6 May 2016 at 12:50 satish saley <satishsale...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I am submitting a spark job using SparkSubmit. When I kill my
>>> application, it does not kill the corresponding spark job. How would I kill
>>> the corresponding spark job? I know, one way is to use SparkSubmit again
>>> with appropriate options. Is there any way though which I can tell
>>> SparkSubmit at the time of job submission itself. Here is my code:
>>>
>>>
>>>-
>>>import org.apache.spark.deploy.SparkSubmit;
>>>- class MyClass{
>>>-
>>>- public static void main(String args[]){
>>>- //preparing args
>>>- SparkSubmit.main(args);
>>>- }
>>>-
>>>- }
>>>
>>>
>


Re: killing spark job which is submitted using SparkSubmit

2016-05-06 Thread Anthony May
Greetings Satish,

What are the arguments you're passing in?

On Fri, 6 May 2016 at 12:50 satish saley  wrote:

> Hello,
>
> I am submitting a spark job using SparkSubmit. When I kill my application,
> it does not kill the corresponding spark job. How would I kill the
> corresponding spark job? I know, one way is to use SparkSubmit again with
> appropriate options. Is there any way though which I can tell SparkSubmit
> at the time of job submission itself. Here is my code:
>
>
>-
>import org.apache.spark.deploy.SparkSubmit;
>- class MyClass{
>-
>- public static void main(String args[]){
>- //preparing args
>- SparkSubmit.main(args);
>- }
>-
>- }
>
>


Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Anthony May
Yeah, there isn't even a RC yet and no documentation but you can work off
the code base and test suites:
https://github.com/apache/spark
And this might help:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala

On Fri, 6 May 2016 at 11:07 Deepak Sharma  wrote:

> Spark 2.0 is yet to come out for public release.
> I am waiting to get hands on it as well.
> Please do let me know if i can download source and build spark2.0 from
> github.
>
> Thanks
> Deepak
>
> On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind 
> wrote:
>
>> Hi All,
>>
>> We are evaluating a few real time streaming query engines and spark is my
>> personal choice. The addition of adhoc queries is what is getting me
>> further excited about it, however the talks I have heard so far only
>> mention about it but do not provide details. I need to build a prototype to
>> ensure it works for our use cases.
>>
>> Can someone point me to relevant material for this.
>>
>> regards
>> Sunita
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>