Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-02 Thread Sean Owen
+0 simply because I don't feel I know enough to have an opinion. I have no
reason to doubt the change though, from a skim through the doc.

On Wed, Nov 1, 2017 at 3:37 PM Reynold Xin  wrote:

> Earlier I sent out a discussion thread for CP in Structured Streaming:
>
> https://issues.apache.org/jira/browse/SPARK-20928
>
> It is meant to be a very small, surgical change to Structured Streaming to
> enable ultra-low latency. This is great timing because we are also
> designing and implementing data source API v2. If designed properly, we can
> have the same data source API working for both streaming and batch.
>
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Let's go ahead and design / implement the SPIP.
> +0: Don't really care.
> -1: I do not think this is a good idea for the following reasons.
>
>
>


RE: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Kevin Grealish
Any update on expected 2.2.1 (or 2.3.0) release process?

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Thursday, October 26, 2017 10:04 AM
To: Sean Owen ; Holden Karau 
Cc: dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

Yes! I can take on RM for 2.2.1.

We are still working out what to do with temp files created by Hive and Java 
that cause the policy issue with CRAN and will report back shortly, hopefully.


From: Sean Owen mailto:so...@cloudera.com>>
Sent: Wednesday, October 25, 2017 4:39:15 AM
To: Holden Karau
Cc: Felix Cheung; dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

It would be reasonably consistent with the timing of other x.y.1 releases, and 
more release managers sounds useful, yeah.

Note also that in theory the code freeze for 2.3.0 starts in about 2 weeks.

On Wed, Oct 25, 2017 at 12:29 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Now that Spark 2.1.2 is out it seems like now is a good time to get started on 
the Spark 2.2.1 release. There are some streaming fixes I'm aware of that would 
be good to get into a release, is there anything else people are working on for 
2.2.1 we should be tracking?

To switch it up I'd like to suggest Felix to be the RM for this since there are 
also likely some R packaging changes to be included in the release. This also 
gives us a chance to see if my updated release documentation if enough for a 
new RM to get started from.

What do folks think?
--
Twitter: 
https://twitter.com/holdenkarau


Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Felix Cheung
For the 2.2.1, we are still working through a few bugs. Hopefully it won't be 
long.



From: Kevin Grealish 
Sent: Thursday, November 2, 2017 9:51:56 AM
To: Felix Cheung; Sean Owen; Holden Karau
Cc: dev@spark.apache.org
Subject: RE: Kicking off the process around Spark 2.2.1

Any update on expected 2.2.1 (or 2.3.0) release process?

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Thursday, October 26, 2017 10:04 AM
To: Sean Owen ; Holden Karau 
Cc: dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

Yes! I can take on RM for 2.2.1.

We are still working out what to do with temp files created by Hive and Java 
that cause the policy issue with CRAN and will report back shortly, hopefully.


From: Sean Owen mailto:so...@cloudera.com>>
Sent: Wednesday, October 25, 2017 4:39:15 AM
To: Holden Karau
Cc: Felix Cheung; dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

It would be reasonably consistent with the timing of other x.y.1 releases, and 
more release managers sounds useful, yeah.

Note also that in theory the code freeze for 2.3.0 starts in about 2 weeks.

On Wed, Oct 25, 2017 at 12:29 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Now that Spark 2.1.2 is out it seems like now is a good time to get started on 
the Spark 2.2.1 release. There are some streaming fixes I’m aware of that would 
be good to get into a release, is there anything else people are working on for 
2.2.1 we should be tracking?

To switch it up I’d like to suggest Felix to be the RM for this since there are 
also likely some R packaging changes to be included in the release. This also 
gives us a chance to see if my updated release documentation if enough for a 
new RM to get started from.

What do folks think?
--
Twitter: 
https://twitter.com/holdenkarau


Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Holden Karau
If it’s desired I’d be happy to start on 2.3 once 2.2.1 is finished.

On Thu, Nov 2, 2017 at 10:24 AM Felix Cheung 
wrote:

> For the 2.2.1, we are still working through a few bugs. Hopefully it won't
> be long.
>
>
> --
> *From:* Kevin Grealish 
> *Sent:* Thursday, November 2, 2017 9:51:56 AM
> *To:* Felix Cheung; Sean Owen; Holden Karau
> *Cc:* dev@spark.apache.org
> *Subject:* RE: Kicking off the process around Spark 2.2.1
>
>
> Any update on expected 2.2.1 (or 2.3.0) release process?
>
>
>
> *From:* Felix Cheung [mailto:felixcheun...@hotmail.com]
> *Sent:* Thursday, October 26, 2017 10:04 AM
> *To:* Sean Owen ; Holden Karau 
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Kicking off the process around Spark 2.2.1
>
>
>
> Yes! I can take on RM for 2.2.1.
>
>
>
> We are still working out what to do with temp files created by Hive and
> Java that cause the policy issue with CRAN and will report back shortly,
> hopefully.
>
>
> --
>
> *From:* Sean Owen 
> *Sent:* Wednesday, October 25, 2017 4:39:15 AM
> *To:* Holden Karau
> *Cc:* Felix Cheung; dev@spark.apache.org
> *Subject:* Re: Kicking off the process around Spark 2.2.1
>
>
>
> It would be reasonably consistent with the timing of other x.y.1 releases,
> and more release managers sounds useful, yeah.
>
>
>
> Note also that in theory the code freeze for 2.3.0 starts in about 2 weeks.
>
>
>
> On Wed, Oct 25, 2017 at 12:29 PM Holden Karau 
> wrote:
>
> Now that Spark 2.1.2 is out it seems like now is a good time to get
> started on the Spark 2.2.1 release. There are some streaming fixes I’m
> aware of that would be good to get into a release, is there anything else
> people are working on for 2.2.1 we should be tracking?
>
>
>
> To switch it up I’d like to suggest Felix to be the RM for this since
> there are also likely some R packaging changes to be included in the
> release. This also gives us a chance to see if my updated release
> documentation if enough for a new RM to get started from.
>
>
>
> What do folks think?
>
> --
>
> Twitter: https://twitter.com/holdenkarau
> 
>
> --
Twitter: https://twitter.com/holdenkarau


Spark build is failing in amplab Jenkins

2017-11-02 Thread Pralabh Kumar
Hi Dev

Spark build is failing in Jenkins


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83353/consoleFull


Python versions prior to 2.7 are not supported.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
ERROR: Step ?Publish JUnit test result report? failed: No test report
files were found. Configuration error?


Please help



Regards

Pralabh Kumar


Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Felix Cheung
I think it will be great to set a feature freeze date for 2.3.0 first, as a 
minor release. There are a few new stuff that would be good to have and then we 
will likely need time to stabilize, before cutting RCs.


From: Holden Karau 
Sent: Thursday, November 2, 2017 10:38:48 AM
To: Felix Cheung; Kevin Grealish; Sean Owen
Cc: dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

If it’s desired I’d be happy to start on 2.3 once 2.2.1 is finished.

On Thu, Nov 2, 2017 at 10:24 AM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
For the 2.2.1, we are still working through a few bugs. Hopefully it won't be 
long.



From: Kevin Grealish mailto:kevin...@microsoft.com>>
Sent: Thursday, November 2, 2017 9:51:56 AM
To: Felix Cheung; Sean Owen; Holden Karau
Cc: dev@spark.apache.org
Subject: RE: Kicking off the process around Spark 2.2.1

Any update on expected 2.2.1 (or 2.3.0) release process?

From: Felix Cheung 
[mailto:felixcheun...@hotmail.com]
Sent: Thursday, October 26, 2017 10:04 AM
To: Sean Owen mailto:so...@cloudera.com>>; Holden Karau 
mailto:hol...@pigscanfly.ca>>
Cc: dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

Yes! I can take on RM for 2.2.1.

We are still working out what to do with temp files created by Hive and Java 
that cause the policy issue with CRAN and will report back shortly, hopefully.


From: Sean Owen mailto:so...@cloudera.com>>
Sent: Wednesday, October 25, 2017 4:39:15 AM
To: Holden Karau
Cc: Felix Cheung; dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

It would be reasonably consistent with the timing of other x.y.1 releases, and 
more release managers sounds useful, yeah.

Note also that in theory the code freeze for 2.3.0 starts in about 2 weeks.

On Wed, Oct 25, 2017 at 12:29 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Now that Spark 2.1.2 is out it seems like now is a good time to get started on 
the Spark 2.2.1 release. There are some streaming fixes I’m aware of that would 
be good to get into a release, is there anything else people are working on for 
2.2.1 we should be tracking?

To switch it up I’d like to suggest Felix to be the RM for this since there are 
also likely some R packaging changes to be included in the release. This also 
gives us a chance to see if my updated release documentation if enough for a 
new RM to get started from.

What do folks think?
--
Twitter: 
https://twitter.com/holdenkarau
--
Twitter: https://twitter.com/holdenkarau


Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Holden Karau
I’m fine with picking a feature freeze, although then we should branch
close to that point. Is there interest in still seeing 2.3 try and go out
around the nominal schedule?

Personally, from a release stand point, I’d rather see 2.2.1 go out first
so we don’t end up with 2.3 potentially going out with missing fixes from
2.2.1 (especially with the CRAN stuff - it would be unfortunate to have
2.2.1 available in CRAN but not be able to provide 2.3 in the same manner).

On Thu, Nov 2, 2017 at 11:06 AM Felix Cheung 
wrote:

> I think it will be great to set a feature freeze date for 2.3.0 first, as
> a minor release. There are a few new stuff that would be good to have and
> then we will likely need time to stabilize, before cutting RCs.
>
> --
> *From:* Holden Karau 
> *Sent:* Thursday, November 2, 2017 10:38:48 AM
> *To:* Felix Cheung; Kevin Grealish; Sean Owen
>
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Kicking off the process around Spark 2.2.1
> If it’s desired I’d be happy to start on 2.3 once 2.2.1 is finished.
>
> On Thu, Nov 2, 2017 at 10:24 AM Felix Cheung 
> wrote:
>
>> For the 2.2.1, we are still working through a few bugs. Hopefully it
>> won't be long.
>>
>>
>> --
>> *From:* Kevin Grealish 
>> *Sent:* Thursday, November 2, 2017 9:51:56 AM
>> *To:* Felix Cheung; Sean Owen; Holden Karau
>> *Cc:* dev@spark.apache.org
>> *Subject:* RE: Kicking off the process around Spark 2.2.1
>>
>>
>> Any update on expected 2.2.1 (or 2.3.0) release process?
>>
>>
>>
>> *From:* Felix Cheung [mailto:felixcheun...@hotmail.com]
>> *Sent:* Thursday, October 26, 2017 10:04 AM
>> *To:* Sean Owen ; Holden Karau 
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: Kicking off the process around Spark 2.2.1
>>
>>
>>
>> Yes! I can take on RM for 2.2.1.
>>
>>
>>
>> We are still working out what to do with temp files created by Hive and
>> Java that cause the policy issue with CRAN and will report back shortly,
>> hopefully.
>>
>>
>> --
>>
>> *From:* Sean Owen 
>> *Sent:* Wednesday, October 25, 2017 4:39:15 AM
>> *To:* Holden Karau
>> *Cc:* Felix Cheung; dev@spark.apache.org
>> *Subject:* Re: Kicking off the process around Spark 2.2.1
>>
>>
>>
>> It would be reasonably consistent with the timing of other x.y.1
>> releases, and more release managers sounds useful, yeah.
>>
>>
>>
>> Note also that in theory the code freeze for 2.3.0 starts in about 2
>> weeks.
>>
>>
>>
>> On Wed, Oct 25, 2017 at 12:29 PM Holden Karau 
>> wrote:
>>
>> Now that Spark 2.1.2 is out it seems like now is a good time to get
>> started on the Spark 2.2.1 release. There are some streaming fixes I’m
>> aware of that would be good to get into a release, is there anything else
>> people are working on for 2.2.1 we should be tracking?
>>
>>
>>
>> To switch it up I’d like to suggest Felix to be the RM for this since
>> there are also likely some R packaging changes to be included in the
>> release. This also gives us a chance to see if my updated release
>> documentation if enough for a new RM to get started from.
>>
>>
>>
>> What do folks think?
>>
>> --
>>
>> Twitter: https://twitter.com/holdenkarau
>> 
>>
>> --
> Twitter: https://twitter.com/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Sean Owen
The feature freeze is "mid November" :
http://spark.apache.org/versioning-policy.html
Let's say... Nov 15? any body have a better date?

Although it'd be nice to get 2.2.1 out sooner than later in all events, and
kind of makes sense to get out first, they need not go in order. It just
might be distracting to deal with 2 at once.

(BTW there was still one outstanding issue from the last release:
https://issues.apache.org/jira/browse/SPARK-22401 )

On Thu, Nov 2, 2017 at 6:06 PM Felix Cheung 
wrote:

> I think it will be great to set a feature freeze date for 2.3.0 first, as
> a minor release. There are a few new stuff that would be good to have and
> then we will likely need time to stabilize, before cutting RCs.
>
>


Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Reynold Xin
Why tie a maintenance release to a feature release? They are supposed to be
independent and we should be able to make a lot of maintenance releases as
needed.

On Thu, Nov 2, 2017 at 7:13 PM Sean Owen  wrote:

> The feature freeze is "mid November" :
> http://spark.apache.org/versioning-policy.html
> Let's say... Nov 15? any body have a better date?
>
> Although it'd be nice to get 2.2.1 out sooner than later in all events,
> and kind of makes sense to get out first, they need not go in order. It
> just might be distracting to deal with 2 at once.
>
> (BTW there was still one outstanding issue from the last release:
> https://issues.apache.org/jira/browse/SPARK-22401 )
>
> On Thu, Nov 2, 2017 at 6:06 PM Felix Cheung 
> wrote:
>
>> I think it will be great to set a feature freeze date for 2.3.0 first, as
>> a minor release. There are a few new stuff that would be good to have and
>> then we will likely need time to stabilize, before cutting RCs.
>>
>>


Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Holden Karau
I agree, except in this case we probably want some of the fixes that are
going into the maintenance release to be present in the new feature release
(like the CRAN issue).

On Thu, Nov 2, 2017 at 12:12 PM, Reynold Xin  wrote:

> Why tie a maintenance release to a feature release? They are supposed to
> be independent and we should be able to make a lot of maintenance releases
> as needed.
>
> On Thu, Nov 2, 2017 at 7:13 PM Sean Owen  wrote:
>
>> The feature freeze is "mid November" : http://spark.apache.org/
>> versioning-policy.html
>> Let's say... Nov 15? any body have a better date?
>>
>> Although it'd be nice to get 2.2.1 out sooner than later in all events,
>> and kind of makes sense to get out first, they need not go in order. It
>> just might be distracting to deal with 2 at once.
>>
>> (BTW there was still one outstanding issue from the last release:
>> https://issues.apache.org/jira/browse/SPARK-22401 )
>>
>> On Thu, Nov 2, 2017 at 6:06 PM Felix Cheung 
>> wrote:
>>
>>> I think it will be great to set a feature freeze date for 2.3.0 first,
>>> as a minor release. There are a few new stuff that would be good to have
>>> and then we will likely need time to stabilize, before cutting RCs.
>>>
>>>


-- 
Twitter: https://twitter.com/holdenkarau


Joining 3 tables with 17 billions records

2017-11-02 Thread Chetan Khatri
Hello Spark Developers,

I have 3 tables that i am reading from HBase and wants to do join
transformation and save to Hive Parquet external table. Currently my join
is failing with container failed error.

1. Read table A from Hbase with ~17 billion records.
2. repartition on primary key of table A
3. create temp view of table A Dataframe
4. Read table B from HBase with ~4 billion records
5. repartition on primary key of table B
6. create temp view of table B Dataframe
7. Join both view of A and B and create Dataframe C
8.  Join Dataframe C with table D
9. coleance(20) to reduce number of file creation on already repartitioned
DF.
10. Finally store to external hive table with partition by skey.

Any Suggestion or resources you come across please do share suggestions on
this to optimize this.

Thanks
Chetan


Re: Joining 3 tables with 17 billions records

2017-11-02 Thread Jörn Franke
Hi,

Do you have a more detailed log/error message? 
Also, can you please provide us details on the tables (no of rows, columns, 
size etc).
Is this just a one time thing or something regular?
If it is a one time thing then I would tend more towards putting each table in 
HDFS (parquet or ORC) and then join them.
What is the Hive and Spark version?

Best regards

> On 2. Nov 2017, at 20:57, Chetan Khatri  wrote:
> 
> Hello Spark Developers,
> 
> I have 3 tables that i am reading from HBase and wants to do join 
> transformation and save to Hive Parquet external table. Currently my join is 
> failing with container failed error.
> 
> 1. Read table A from Hbase with ~17 billion records.
> 2. repartition on primary key of table A
> 3. create temp view of table A Dataframe
> 4. Read table B from HBase with ~4 billion records
> 5. repartition on primary key of table B
> 6. create temp view of table B Dataframe
> 7. Join both view of A and B and create Dataframe C
> 8.  Join Dataframe C with table D
> 9. coleance(20) to reduce number of file creation on already repartitioned DF.
> 10. Finally store to external hive table with partition by skey.
> 
> Any Suggestion or resources you come across please do share suggestions on 
> this to optimize this.
> 
> Thanks
> Chetan

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Joining 3 tables with 17 billions records

2017-11-02 Thread Chetan Khatri
Jorn,

This is kind of one time load from Historical Data to Analytical Hive
engine. Hive version 1.2.1 and Spark version 2.0.1 with MapR distribution.

Writing every table to parquet and reading it could be very much time
consuming, currently entire job could take ~8 hours on 8 node of 100 Gig
ram 20 core cluster, not only used utilized by me but by larger team.

Thanks


On Fri, Nov 3, 2017 at 1:31 AM, Jörn Franke  wrote:

> Hi,
>
> Do you have a more detailed log/error message?
> Also, can you please provide us details on the tables (no of rows,
> columns, size etc).
> Is this just a one time thing or something regular?
> If it is a one time thing then I would tend more towards putting each
> table in HDFS (parquet or ORC) and then join them.
> What is the Hive and Spark version?
>
> Best regards
>
> > On 2. Nov 2017, at 20:57, Chetan Khatri 
> wrote:
> >
> > Hello Spark Developers,
> >
> > I have 3 tables that i am reading from HBase and wants to do join
> transformation and save to Hive Parquet external table. Currently my join
> is failing with container failed error.
> >
> > 1. Read table A from Hbase with ~17 billion records.
> > 2. repartition on primary key of table A
> > 3. create temp view of table A Dataframe
> > 4. Read table B from HBase with ~4 billion records
> > 5. repartition on primary key of table B
> > 6. create temp view of table B Dataframe
> > 7. Join both view of A and B and create Dataframe C
> > 8.  Join Dataframe C with table D
> > 9. coleance(20) to reduce number of file creation on already
> repartitioned DF.
> > 10. Finally store to external hive table with partition by skey.
> >
> > Any Suggestion or resources you come across please do share suggestions
> on this to optimize this.
> >
> > Thanks
> > Chetan
>


Re: Joining 3 tables with 17 billions records

2017-11-02 Thread Jörn Franke
Well this sounds a lot for “only” 17 billion. However you can limit the 
resources of the job so no need that it takes all of them (might be a little 
bit longer).
Alternatively did you try to use the hbase tables directly in Hive as external 
tables and do a simple ctas? Works better if Hive is on Tez but might be also 
worth a try with mr as an engine.

> On 2. Nov 2017, at 21:08, Chetan Khatri  wrote:
> 
> Jorn,
> 
> This is kind of one time load from Historical Data to Analytical Hive engine. 
> Hive version 1.2.1 and Spark version 2.0.1 with MapR distribution.
> 
> Writing every table to parquet and reading it could be very much time 
> consuming, currently entire job could take ~8 hours on 8 node of 100 Gig  ram 
> 20 core cluster, not only used utilized by me but by larger team.
> 
> Thanks
> 
> 
>> On Fri, Nov 3, 2017 at 1:31 AM, Jörn Franke  wrote:
>> Hi,
>> 
>> Do you have a more detailed log/error message?
>> Also, can you please provide us details on the tables (no of rows, columns, 
>> size etc).
>> Is this just a one time thing or something regular?
>> If it is a one time thing then I would tend more towards putting each table 
>> in HDFS (parquet or ORC) and then join them.
>> What is the Hive and Spark version?
>> 
>> Best regards
>> 
>> > On 2. Nov 2017, at 20:57, Chetan Khatri  
>> > wrote:
>> >
>> > Hello Spark Developers,
>> >
>> > I have 3 tables that i am reading from HBase and wants to do join 
>> > transformation and save to Hive Parquet external table. Currently my join 
>> > is failing with container failed error.
>> >
>> > 1. Read table A from Hbase with ~17 billion records.
>> > 2. repartition on primary key of table A
>> > 3. create temp view of table A Dataframe
>> > 4. Read table B from HBase with ~4 billion records
>> > 5. repartition on primary key of table B
>> > 6. create temp view of table B Dataframe
>> > 7. Join both view of A and B and create Dataframe C
>> > 8.  Join Dataframe C with table D
>> > 9. coleance(20) to reduce number of file creation on already repartitioned 
>> > DF.
>> > 10. Finally store to external hive table with partition by skey.
>> >
>> > Any Suggestion or resources you come across please do share suggestions on 
>> > this to optimize this.
>> >
>> > Thanks
>> > Chetan
>