Re: SparkSession replace SQLContext

2016-07-05 Thread Romi Kuntsman
You can also claim that there's a whole section of "Migrating from 1.6 to
2.0" missing there:
https://spark.apache.org/docs/2.0.0-preview/sql-programming-guide.html#migration-guide

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Tue, Jul 5, 2016 at 12:24 PM, nihed mbarek <nihe...@gmail.com> wrote:

> Hi,
>
> I just discover that that SparkSession will replace SQLContext for spark
> 2.0
> JavaDoc is clear
> https://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/sql/SparkSession.html
> but there is no mention in sql programming guide
>
> https://spark.apache.org/docs/2.0.0-preview/sql-programming-guide.html#starting-point-sqlcontext
>
> Is it possible to update documentation before the release ?
>
>
> Thank you
>
> --
>
> MBAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> <http://tn.linkedin.com/in/nihed>
>
>


Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Romi Kuntsman
+1 for Java 8 only

I think it will make it easier to make a unified API for Java and Scala,
instead of the wrappers of Java over Scala.
On Mar 24, 2016 11:46 AM, "Stephen Boesch"  wrote:

> +1 for java8 only   +1 for 2.11+ only .At this point scala libraries
> supporting only 2.10 are typically less active and/or poorly maintained.
> That trend will only continue when considering the lifespan of spark 2.X.
>
> 2016-03-24 11:32 GMT-07:00 Steve Loughran :
>
>>
>> On 24 Mar 2016, at 15:27, Koert Kuipers  wrote:
>>
>> i think the arguments are convincing, but it also makes me wonder if i
>> live in some kind of alternate universe... we deploy on customers clusters,
>> where the OS, python version, java version and hadoop distro are not chosen
>> by us. so think centos 6, cdh5 or hdp 2.3, java 7 and python 2.6. we simply
>> have access to a single proxy machine and launch through yarn. asking them
>> to upgrade java is pretty much out of the question or a 6+ month ordeal. of
>> the 10 client clusters i can think of on the top of my head all of them are
>> on java 7, none are on java 8. so by doing this you would make spark 2
>> basically unusable for us (unless most of them have plans of upgrading in
>> near term to java 8, i will ask around and report back...).
>>
>>
>>
>> It's not actually mandatory for the process executing in the Yarn cluster
>> to run with the same JVM as the rest of the Hadoop stack; all that is
>> needed is for the environment variables to set up the JAVA_HOME and PATH.
>> Switching JVMs not something which YARN makes it easy to do, but it may be
>> possible, especially if Spark itself provides some hooks, so you don't have
>> to manually lay with setting things up. That may be something which could
>> significantly ease adoption of Spark 2 in YARN clusters. Same for Python.
>>
>> This is something I could probably help others to address
>>
>>
>


Re: Spark 1.6.1

2016-02-22 Thread Romi Kuntsman
Sounds fair. Is it to avoid cluttering maven central with too many
intermediate versions?

What do I need to add in my pom.xml  section to make it work?

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Tue, Feb 23, 2016 at 9:34 AM, Reynold Xin <r...@databricks.com> wrote:

> We usually publish to a staging maven repo hosted by the ASF (not maven
> central).
>
>
>
> On Mon, Feb 22, 2016 at 11:32 PM, Romi Kuntsman <r...@totango.com> wrote:
>
>> Is it possible to make RC versions available via Maven? (many projects do
>> that)
>> That will make integration much easier, so many more people can test the
>> version before the final release.
>> Thanks!
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Tue, Feb 23, 2016 at 8:07 AM, Luciano Resende <luckbr1...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Mon, Feb 22, 2016 at 9:08 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> An update: people.apache.org has been shut down so the release scripts
>>>> are broken. Will try again after we fix them.
>>>>
>>>>
>>> If you skip uploading to people.a.o, it should still be available in
>>> nexus for review.
>>>
>>> The other option is to add the RC into
>>> https://dist.apache.org/repos/dist/dev/
>>>
>>>
>>>
>>> --
>>> Luciano Resende
>>> http://people.apache.org/~lresende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>>
>>
>


Re: Spark 1.6.1

2016-02-22 Thread Romi Kuntsman
Is it possible to make RC versions available via Maven? (many projects do
that)
That will make integration much easier, so many more people can test the
version before the final release.
Thanks!

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Tue, Feb 23, 2016 at 8:07 AM, Luciano Resende <luckbr1...@gmail.com>
wrote:

>
>
> On Mon, Feb 22, 2016 at 9:08 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> An update: people.apache.org has been shut down so the release scripts
>> are broken. Will try again after we fix them.
>>
>>
> If you skip uploading to people.a.o, it should still be available in nexus
> for review.
>
> The other option is to add the RC into
> https://dist.apache.org/repos/dist/dev/
>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


Re: Spark 1.6.1

2016-02-02 Thread Romi Kuntsman
Hi Michael,
What about the memory leak bug?
https://issues.apache.org/jira/browse/SPARK-11293
Even after the memory rewrite in 1.6.0, it still happens in some cases.
Will it be fixed for 1.6.1?
Thanks,

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Feb 1, 2016 at 9:59 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> We typically do not allow changes to the classpath in maintenance releases.
>
> On Mon, Feb 1, 2016 at 8:16 AM, Hamel Kothari <hamelkoth...@gmail.com>
> wrote:
>
>> I noticed that the Jackson dependency was bumped to 2.5 in master for
>> something spark-streaming related. Is there any reason that this upgrade
>> can't be included with 1.6.1?
>>
>> According to later comments on this thread:
>> https://issues.apache.org/jira/browse/SPARK-8332 and my personal
>> experience using with Spark with Jackson 2.5 hasn't caused any issues but
>> it does have some useful new features. It should be fully backwards
>> compatible according to the Jackson folks.
>>
>> On Mon, Feb 1, 2016 at 10:29 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> SPARK-12624 has been resolved.
>>> According to Wenchen, SPARK-12783 is fixed in 1.6.0 release.
>>>
>>> Are there other blockers for Spark 1.6.1 ?
>>>
>>> Thanks
>>>
>>> On Wed, Jan 13, 2016 at 5:39 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> Hey All,
>>>>
>>>> While I'm not aware of any critical issues with 1.6.0, there are
>>>> several corner cases that users are hitting with the Dataset API that are
>>>> fixed in branch-1.6.  As such I'm considering a 1.6.1 release.
>>>>
>>>> At the moment there are only two critical issues targeted for 1.6.1:
>>>>  - SPARK-12624 - When schema is specified, we should treat undeclared
>>>> fields as null (in Python)
>>>>  - SPARK-12783 - Dataset map serialization error
>>>>
>>>> When these are resolved I'll likely begin the release process.  If
>>>> there are any other issues that we should wait for please contact me.
>>>>
>>>> Michael
>>>>
>>>
>>>
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Romi Kuntsman
If they have a problem managing memory, wouldn't there should be a OOM?
Why does AppClient throw a NPE?

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Is that all you have in the executor logs? I suspect some of those jobs
> are having a hard time managing  the memory.
>
> Thanks
> Best Regards
>
> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <r...@totango.com> wrote:
>
>> [adding dev list since it's probably a bug, but i'm not sure how to
>> reproduce so I can open a bug about it]
>>
>> Hi,
>>
>> I have a standalone Spark 1.4.0 cluster with 100s of applications running
>> every day.
>>
>> From time to time, the applications crash with the following error (see
>> below)
>> But at the same time (and also after that), other applications are
>> running, so I can safely assume the master and workers are working.
>>
>> 1. why is there a NullPointerException? (i can't track the scala stack
>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>> actually a network error...)
>> 2. why can't it connect to the master? (if it's a network timeout, how to
>> increase it? i see the values are hardcoded inside AppClient)
>> 3. how to recover from this error?
>>
>>
>>   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
>> logs/error.log
>>   java.lang.NullPointerException NullPointerException
>>   at
>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>   at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>   at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>   at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>   at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>   at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>   at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>   at
>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>   at
>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>   at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>   at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>   at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>   at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>   at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>   ERROR 01-11 15:32:55,603   SparkContext - Error
>> initializing SparkContext. ERROR
>>   java.lang.IllegalStateException: Cannot call methods on a stopped
>> SparkContext
>>   at org.apache.spark.SparkContext.org
>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>   at
>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>   at
>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>   at org.apache.spark.SparkContext.(SparkContext.scala:543)
>>   at
>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>>
>>
>> Thanks!
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Romi Kuntsman
I didn't see anything about a OOM.
This happens sometimes before anything in the application happened, and
happens to a few applications at the same time - so I guess it's a
communication failure, but the problem is that the error shown doesn't
represent the actual problem (which may be a network timeout etc)

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Nov 9, 2015 at 6:00 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Did you find anything regarding the OOM in the executor logs?
>
> Thanks
> Best Regards
>
> On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman <r...@totango.com> wrote:
>
>> If they have a problem managing memory, wouldn't there should be a OOM?
>> Why does AppClient throw a NPE?
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Is that all you have in the executor logs? I suspect some of those jobs
>>> are having a hard time managing  the memory.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <r...@totango.com> wrote:
>>>
>>>> [adding dev list since it's probably a bug, but i'm not sure how to
>>>> reproduce so I can open a bug about it]
>>>>
>>>> Hi,
>>>>
>>>> I have a standalone Spark 1.4.0 cluster with 100s of applications
>>>> running every day.
>>>>
>>>> From time to time, the applications crash with the following error (see
>>>> below)
>>>> But at the same time (and also after that), other applications are
>>>> running, so I can safely assume the master and workers are working.
>>>>
>>>> 1. why is there a NullPointerException? (i can't track the scala stack
>>>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>>>> actually a network error...)
>>>> 2. why can't it connect to the master? (if it's a network timeout, how
>>>> to increase it? i see the values are hardcoded inside AppClient)
>>>> 3. how to recover from this error?
>>>>
>>>>
>>>>   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
>>>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>>>   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
>>>> logs/error.log
>>>>   java.lang.NullPointerException NullPointerException
>>>>   at
>>>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>>>   at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>   at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>   at
>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>   at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>>>   at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>>>   at
>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>>   at
>>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>>>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>>   at
>>>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>>   at
>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>>>   at
>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>   at
>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>   at
>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>   at
>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>   ERROR 01-11 15:32:55,603   SparkContext - Error
>>>> initializing SparkContext. ERROR
>>>>   java.lang.IllegalStateException: Cannot call methods on a stopped
>>>> SparkContext
>>>>   at org.apache.spark.SparkContext.org
>>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>>>   at
>>>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>>>   at
>>>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>>>   at org.apache.spark.SparkContext.(SparkContext.scala:543)
>>>>   at
>>>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> *Romi Kuntsman*, *Big Data Engineer*
>>>> http://www.totango.com
>>>>
>>>
>>>
>>
>


Re: Ready to talk about Spark 2.0?

2015-11-08 Thread Romi Kuntsman
A major release usually means giving up on some API backward compatibility?
Can this be used as a chance to merge efforts with Apache Flink (
https://flink.apache.org/) and create the one ultimate open source big data
processing system?
Spark currently feels like it was made for interactive use (like Python and
R), and when used others (batch/streaming), it feels like scripted
interactive instead of really a standalone complete app. Maybe some base
concepts may be adapted?

(I'm not currently a committer, but as a heavy Spark user I'd love to
participate in the discussion of what can/should be in Spark 2.0)

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Fri, Nov 6, 2015 at 2:53 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Sean,
>
> Happy to see this discussion.
>
> I'm working on PoC to run Camel on Spark Streaming. The purpose is to have
> an ingestion and integration platform directly running on Spark Streaming.
>
> Basically, we would be able to use a Camel Spark DSL like:
>
>
> from("jms:queue:foo").choice().when(predicate).to("job:bar").when(predicate).to("hdfs:path").otherwise("file:path")
>
> Before a formal proposal (I have to do more work there), I'm just
> wondering if such framework can be a new Spark module (Spark Integration
> for instance, like Spark ML, Spark Stream, etc).
>
> Maybe it could be a good candidate for an addition in a "major" release
> like Spark 2.0.
>
> Just my $0.01 ;)
>
> Regards
> JB
>
>
> On 11/06/2015 01:44 PM, Sean Owen wrote:
>
>> Since branch-1.6 is cut, I was going to make version 1.7.0 in JIRA.
>> However I've had a few side conversations recently about Spark 2.0, and
>> I know I and others have a number of ideas about it already.
>>
>> I'll go ahead and make 1.7.0, but thought I'd ask, how much other
>> interest is there in starting to plan Spark 2.0? is that even on the
>> table as the next release after 1.6?
>>
>> Sean
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Ready to talk about Spark 2.0?

2015-11-08 Thread Romi Kuntsman
Hi, thanks for the feedback
I'll try to explain better what I meant.

First we had RDDs, then we had DataFrames, so could the next step be
something like stored procedures over DataFrames?
So I define the whole calculation flow, even if it includes any "actions"
in between, and the whole thing is planned and executed in a super
optimized way once I tell it "go!"

What I mean by "feels like scripted" is that actions come back to the
driver, like they would if you were in front of a command prompt.
But often the flow contains many steps with actions in between - multiple
levels of aggregations, iterative machine learning algorithms etc.
Sending the whole "workplan" to the Spark framework would be, as I see it,
the next step of it's evolution, like stored procedures send a logic with
many SQL queries to the database.

Was it more clear this time? :)


*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Sun, Nov 8, 2015 at 5:59 PM, Koert Kuipers <ko...@tresata.com> wrote:

> romi,
> unless am i misunderstanding your suggestion you might be interested in
> projects like the new mahout where they try to abstract out the engine with
> bindings, so that they can support multiple engines within a single
> platform. I guess cascading is heading in a similar direction (although no
> spark or flink yet there, just mr1 and tez).
>
> On Sun, Nov 8, 2015 at 6:33 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> Major releases can change APIs, yes. Although Flink is pretty similar
>> in broad design and goals, the APIs are quite different in
>> particulars. Speaking for myself, I can't imagine merging them, as it
>> would either mean significantly changing Spark APIs, or making Flink
>> use Spark APIs. It would mean effectively removing one project which
>> seems infeasible.
>>
>> I am not sure of what you're saying the difference is, but I would not
>> describe Spark as primarily for interactive use.
>>
>> Philosophically, I don't think One Big System to Rule Them All is a
>> good goal. One project will never get it all right even within one
>> niche. It's actually valuable to have many takes on important
>> problems. Hence any problem worth solving gets solved 10 times. Just
>> look at all those SQL engines and logging frameworks...
>>
>> On Sun, Nov 8, 2015 at 10:53 AM, Romi Kuntsman <r...@totango.com> wrote:
>> > A major release usually means giving up on some API backward
>> compatibility?
>> > Can this be used as a chance to merge efforts with Apache Flink
>> > (https://flink.apache.org/) and create the one ultimate open source
>> big data
>> > processing system?
>> > Spark currently feels like it was made for interactive use (like Python
>> and
>> > R), and when used others (batch/streaming), it feels like scripted
>> > interactive instead of really a standalone complete app. Maybe some base
>> > concepts may be adapted?
>> >
>> > (I'm not currently a committer, but as a heavy Spark user I'd love to
>> > participate in the discussion of what can/should be in Spark 2.0)
>> >
>> > Romi Kuntsman, Big Data Engineer
>> > http://www.totango.com
>> >
>> > On Fri, Nov 6, 2015 at 2:53 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> > wrote:
>> >>
>> >> Hi Sean,
>> >>
>> >> Happy to see this discussion.
>> >>
>> >> I'm working on PoC to run Camel on Spark Streaming. The purpose is to
>> have
>> >> an ingestion and integration platform directly running on Spark
>> Streaming.
>> >>
>> >> Basically, we would be able to use a Camel Spark DSL like:
>> >>
>> >>
>> >>
>> from("jms:queue:foo").choice().when(predicate).to("job:bar").when(predicate).to("hdfs:path").otherwise("file:path")
>> >>
>> >> Before a formal proposal (I have to do more work there), I'm just
>> >> wondering if such framework can be a new Spark module (Spark
>> Integration for
>> >> instance, like Spark ML, Spark Stream, etc).
>> >>
>> >> Maybe it could be a good candidate for an addition in a "major" release
>> >> like Spark 2.0.
>> >>
>> >> Just my $0.01 ;)
>> >>
>> >> Regards
>> >> JB
>> >>
>> >>
>> >> On 11/06/2015 01:44 PM, Sean Owen wrote:
>> >>>
>> >>> Since branch-1.6 is cut, I was going to make version 1.7.0 in JIRA.
>> >>> However I've had a few s

Re: Ready to talk about Spark 2.0?

2015-11-08 Thread Romi Kuntsman
Since it seems we do have so much to talk about Spark 2.0, then the answer
to the question "ready to talk about spark 2" is yes.
But that doesn't mean the development of the 1.x branch is ready to stop or
that there shouldn't be a 1.7 release.

Regarding what should go into the next major version - obviously on the
technical level it's breaking API changes and perhaps some long-awaited
architectural refactoring.

But what I think should be the major change is on the conceptual side - the
realization that the way interactive, batch and streaming data flows work
are fundamentally different, and building the framework around that will
benefit each of those flows (like events instead of microbatches in
streaming, worker-side intermediate processing in batch, etc).

So where is the best way to have a full Spark 2.0 discussion?

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Sun, Nov 8, 2015 at 10:10 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> Yes, that's clearer -- at least to me.
>
> But before going any further, let me note that we are already sliding past
> Sean's opening question of "Should we start talking about Spark 2.0?" to
> actually start talking about Spark 2.0.  I'll try to keep the rest of this
> post at a higher- or meta-level in order to attempt to avoid a somewhat
> premature discussion of detailed 2.0 proposals, since I think that we do
> still need to answer Sean's question and a couple of related questions
> before really diving into the details of 2.0 planning.  The related
> questions that I am talking about are: Is Spark 1.x done except for
> bug-fixing? and What would definitely make us say that we must begin
> working on Spark 2.0?
>
> I'm not going to try to answer my own two questions even though I'm really
> interested in how others will answer them, but I will answer Sean's by
> saying that it is a good time to start talking about Spark 2.0 -- which is
> quite different from saying that we are close to an understanding of what
> will differentiate Spark 2.0 or when we want to deliver it.
>
> On the meta-2.0 discussion, I think that it is useful to break "Things
> that will be different in 2.0" into some distinct categories.  I see at
> least three such categories for openers, although the third will probably
> need to be broken down further.
>
> The first is the simplest, would take almost no time to complete, and
> would have minimal impact on current Spark users.  This is simply getting
> rid of everything that is already marked deprecated in Spark 1.x but that
> we haven't already gotten rid of because of our commitment to maintaining
> API stability within major versions.  There should be no need for
> discussion or apology before getting rid of what is already deprecated --
> it's just gone and it's time to move on.  Kind of a category-1.1 are parts
> of the the current public API that are now marked as Experimental or
> Developer that should become part of the fully-supported public API in 2.0
> -- and there is room for debate here.
>
> The next category of things that will be different in 2.0 isn't a lot
> harder to implement, shouldn't take a lot of time to complete, but will
> have some impact on current Spark users.  I'm talking about areas in the
> current code that we know don't work the way we want them to and don't have
> the public API that we would like, but for which there aren't or can't be
> recommended alternatives yet, so the code isn't formally marked as
> deprecated.  Again, these are things that we haven't already changed mostly
> because of the need to maintain API stability in 1.x.  But because these
> haven't already been marked as deprecated, there is potential to catch
> existing Spark users by surprise when the API changes.  We don't guarantee
> API stability across major version number changes, so there isn't any
> reason why we can't make the changes we want, but we should start building
> up a comprehensive list of API changes that will occur in Spark 2.0 to at
> least minimize the amount of surprise for current Spark users.
>
> I don't already have anything like such a comprehensive list, but one
> example of the kind of thing that I am talking about is something that I've
> personally been looking at and regretting of late, and that's the
> complicated relationships among SparkListener, SQLListener, onJobEnd and
> onExecutionEnd.  A lot of this complication is because of the need to
> maintain the public API, so we end up with comments like this (
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala#L58):
> "Ideally, we need to make sure onExecutionEnd happens after onJobStart and
> onJobEnd.  However, onJobStart and onJobEnd run in the listen

Re: Getting Started

2015-11-02 Thread Romi Kuntsman
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Fri, Oct 30, 2015 at 1:25 PM, Saurabh Shah <shahsaurabh0...@gmail.com>
wrote:

> Hello, my name is Saurabh Shah and I am a second year undergraduate
> student at DA-IICT, Gandhinagar, India. I have quite lately been
> contributing towards the open source organizations and I find your
> organization the most appropriate one to work on.
>
> I request you to please guide me through the installation of your codebase
> and how to get started to your organization.
>
>
> Thanking You,
>
> Saurabh Shah.
>


Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-01 Thread Romi Kuntsman
[adding dev list since it's probably a bug, but i'm not sure how to
reproduce so I can open a bug about it]

Hi,

I have a standalone Spark 1.4.0 cluster with 100s of applications running
every day.

>From time to time, the applications crash with the following error (see
below)
But at the same time (and also after that), other applications are running,
so I can safely assume the master and workers are working.

1. why is there a NullPointerException? (i can't track the scala stack
trace to the code, but anyway NPE is usually a obvious bug even if there's
actually a network error...)
2. why can't it connect to the master? (if it's a network timeout, how to
increase it? i see the values are hardcoded inside AppClient)
3. how to recover from this error?


  ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application has
been killed. Reason: All masters are unresponsive! Giving up. ERROR
  ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
logs/error.log
  java.lang.NullPointerException NullPointerException
  at
org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
  at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
  at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
  at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
  at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
  at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
  at
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
  at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
  at
org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
  at akka.actor.ActorCell.invoke(ActorCell.scala:487)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
  at akka.dispatch.Mailbox.run(Mailbox.scala:220)
  at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
  at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  ERROR 01-11 15:32:55,603   SparkContext - Error
initializing SparkContext. ERROR
  java.lang.IllegalStateException: Cannot call methods on a stopped
SparkContext
  at org.apache.spark.SparkContext.org
$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
  at
org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
  at
org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
  at org.apache.spark.SparkContext.(SparkContext.scala:543)
  at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)


Thanks!

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com


Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
sparkConext is available on the driver, not on executors.

To read from Cassandra, you can use something like this:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Sep 21, 2015 at 2:27 PM, Priya Ch <learnings.chitt...@gmail.com>
wrote:

> can i use this sparkContext on executors ??
> In my application, i have scenario of reading from db for certain records
> in rdd. Hence I need sparkContext to read from DB (cassandra in our case),
>
> If sparkContext couldn't be sent to executors , what is the workaround for
> this ??
>
> On Mon, Sep 21, 2015 at 3:06 PM, Petr Novak <oss.mli...@gmail.com> wrote:
>
>> add @transient?
>>
>> On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch <learnings.chitt...@gmail.com>
>> wrote:
>>
>>> Hello All,
>>>
>>> How can i pass sparkContext as a parameter to a method in an object.
>>> Because passing sparkContext is giving me TaskNotSerializable Exception.
>>>
>>> How can i achieve this ?
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>
>>
>


Re: how to send additional configuration to the RDD after it was lazily created

2015-09-21 Thread Romi Kuntsman
What new information do you know after creating the RDD, that you didn't
know at the time of it's creation?
I think the whole point is that RDD is immutable, you can't change it once
it was created.
Perhaps you need to refactor your logic to know the parameters earlier, or
create a whole new RDD again.

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Thu, Sep 17, 2015 at 10:07 AM, Gil Vernik <g...@il.ibm.com> wrote:

> Hi,
>
> I have the following case, which i am not sure how to resolve.
>
> My code uses HadoopRDD and creates various RDDs on top of it
> (MapPartitionsRDD, and so on )
> After all RDDs were lazily created, my code "knows" some new information
> and i want that "compute" method of the HadoopRDD will be aware of it (at
> the point when "compute" method will be called).
> What is the possible way 'to send' some additional information to the
> compute method of the HadoopRDD after this RDD is lazily created?
> I tried to play with configuration, like to perform set("test","111") in
> the code and modify the compute method of HadoopRDD with get("test") - but
> of it's not working,  since SparkContext has only clone of the of the
> configuration and it can't be modified in run time.
>
> Any thoughts how can i make it?
>
> Thanks
> Gil.


Re: One corrupt gzip in a directory of 100s

2015-04-01 Thread Romi Kuntsman
What about communication errors and not corrupted files?
Both when reading input and when writing output.
We currently experience a failure of the entire process, if the last stage
of writing the output (to Amazon S3) failed because of a very temporary DNS
resolution issue (easily resolved by retrying).

*Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com

On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik g...@il.ibm.com wrote:

 I actually saw the same issue, where we analyzed some container with few
 hundreds of GBs zip files - one was corrupted and Spark exit with
 Exception on the entire job.
 I like SPARK-6593, since it  can cover also additional cases, not just in
 case of corrupted zip files.



 From:   Dale Richardson dale...@hotmail.com
 To: dev@spark.apache.org dev@spark.apache.org
 Date:   29/03/2015 11:48 PM
 Subject:One corrupt gzip in a directory of 100s



 Recently had an incident reported to me where somebody was analysing a
 directory of gzipped log files, and was struggling to load them into spark
 because one of the files was corrupted - calling
 sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
 executor that was reading that file, which caused the entire job to be
 cancelled after the retry count was exceeded, without any way of catching
 and recovering from the error.  While normally I think it is entirely
 appropriate to stop execution if something is wrong with your input,
 sometimes it is useful to analyse what you can get (as long as you are
 aware that input has been skipped), and treat corrupt files as acceptable
 losses.
 To cater for this particular case I've added SPARK-6593 (PR at
 https://github.com/apache/spark/pull/5250). Which adds an option
 (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
 Input format, but to continue on with the next task.
 Ideally in this case you would want to report the corrupt file paths back
 to the master so they could be dealt with in a particular way (eg moved to
 a separate directory), but that would require a public API
 change/addition. I was pondering on an addition to Spark's hadoop API that
 could report processing status back to the master via an optional
 accumulator that collects filepath/Option(exception message) tuples so the
 user has some idea of what files are being processed, and what files are
 being skipped.
 Regards,Dale.



Re: Spark client reconnect to driver in yarn-cluster deployment mode

2015-01-19 Thread Romi Kuntsman
in yarn-client mode it only controls the environment of the executor
launcher

So you either use yarn-client mode, and then your app keeps running and
controlling the process
Or you use yarn-cluster mode, and then you send a jar to YARN, and that jar
should have code to report the result back to you

*Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com

On Thu, Jan 15, 2015 at 1:52 PM, preeze etan...@gmail.com wrote:

 From the official spark documentation
 (http://spark.apache.org/docs/1.2.0/running-on-yarn.html):

 In yarn-cluster mode, the Spark driver runs inside an application master
 process which is managed by YARN on the cluster, and the client can go away
 after initiating the application.

 Is there any designed way that the client connects back to the driver
 (still
 running in YARN) for collecting results at a later stage?



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-client-reconnect-to-driver-in-yarn-cluster-deployment-mode-tp10122.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org