Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Felix Cheung
Very subtle but someone might take

“We will drop Python 2 support in a future release in 2020”

To mean any / first release in 2020. Whereas the next statement indicates patch 
release is not included in above. Might help reorder the items or clarify the 
wording.



From: shane knapp 
Sent: Friday, May 31, 2019 7:38:10 PM
To: Denny Lee
Cc: Holden Karau; Bryan Cutler; Erik Erlandson; Felix Cheung; Mark Hamstra; 
Matei Zaharia; Reynold Xin; Sean Owen; Wenchen Fen; Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?

+1000  ;)

On Sat, Jun 1, 2019 at 6:53 AM Denny Lee 
mailto:denny.g@gmail.com>> wrote:
+1

On Fri, May 31, 2019 at 17:58 Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
+1

On Fri, May 31, 2019 at 5:41 PM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
+1 and the draft sounds good

On Thu, May 30, 2019, 11:32 AM Xiangrui Meng 
mailto:men...@gmail.com>> wrote:
Here is the draft announcement:

===
Plan for dropping Python 2 support

As many of you already knew, Python core development team and many utilized 
Python packages like Pandas and NumPy will drop Python 2 support in or before 
2020/01/01. Apache Spark has supported both Python 2 and 3 since Spark 1.4 
release in 2015. However, maintaining Python 2/3 compatibility is an increasing 
burden and it essentially limits the use of Python 3 features in Spark. Given 
the end of life (EOL) of Python 2 is coming, we plan to eventually drop Python 
2 support as well. The current plan is as follows:

* In the next major release in 2019, we will deprecate Python 2 support. 
PySpark users will see a deprecation warning if Python 2 is used. We will 
publish a migration guide for PySpark users to migrate to Python 3.
* We will drop Python 2 support in a future release in 2020, after Python 2 EOL 
on 2020/01/01. PySpark users will see an error if Python 2 is used.
* For releases that support Python 2, e.g., Spark 2.4, their patch releases 
will continue supporting Python 2. However, after Python 2 EOL, we might not 
take patches that are specific to Python 2.
===

Sean helped make a pass. If it looks good, I'm going to upload it to Spark 
website and announce it here. Let me know if you think we should do a VOTE 
instead.

On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng 
mailto:men...@gmail.com>> wrote:
I created https://issues.apache.org/jira/browse/SPARK-27884 to track the work.

On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
We don’t usually reference a future release on website

> Spark website and state that Python 2 is deprecated in Spark 3.0

I suspect people will then ask when is Spark 3.0 coming out then. Might need to 
provide some clarity on that.

We can say the "next major release in 2019" instead of Spark 3.0. Spark 3.0 
timeline certainly requires a new thread to discuss.




From: Reynold Xin mailto:r...@databricks.com>>
Sent: Thursday, May 30, 2019 12:59:14 AM
To: shane knapp
Cc: Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen Fen; 
Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?

+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support 
for python2.7 for spark 2.x as the feature sets won't be expanding.

that being said, i will be cracking a bottle of champagne when i can delete all 
of the ansible and anaconda configs for python2.x.  :)

On the development side, in a future release that drops Python 2 support we can 
remove code that maintains python 2/3 compatibility and start using python 3 
only features, which is also quite exciting.


shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-31 Thread Georg Heiler
Bucketing will only help you with joins. And these usually happen on a key.
You mentioned that there is no such key in your data. If just want to
search through large  quantities of data  sorting an partitioning by time
is left.

Rishi Shah  schrieb am Sa. 1. Juni 2019 um 05:57:

> Thanks much for your input Gourav, Silvio.
>
> I have about 10TB of data, which gets stored daily. There's no qualifying
> column for partitioning, which makes querying this table super slow. So I
> wanted to sort the results before storing them daily. This is why I was
> thinking to use bucketing and sorting ... Do you think sorting data based
> on a column or two before saving would help query performance on this
> table?
>
> My concern is, data will be sorted on daily basis and not globally. Would
> that help with performance? I can compact files every month as well and
> sort before saving. Just not sure if this is going to help with performance
> issues on this table.
>
> Would be great to get your advice on this.
>
>
>
>
>
>
>
>
>
> On Fri, May 31, 2019 at 10:42 AM Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>> Spark does allow appending new files to bucketed tables. When the data is
>> read in, Spark will combine the multiple files belonging to the same
>> buckets into the same partitions.
>>
>>
>>
>> Having said that, you need to be very careful with bucketing especially
>> as you’re appending to avoid generating lots of small files. So, you may
>> need to consider periodically running a compaction job.
>>
>>
>>
>> If you’re simply appending daily snapshots, then you could just consider
>> using date partitions, instead?
>>
>>
>>
>> *From: *Rishi Shah 
>> *Date: *Thursday, May 30, 2019 at 10:43 PM
>> *To: *"user @spark" 
>> *Subject: *[pyspark 2.3+] Bucketing with sort - incremental data load?
>>
>>
>>
>> Hi All,
>>
>>
>>
>> Can we use bucketing with sorting functionality to save data
>> incrementally (say daily) ? I understand bucketing is supported in Spark
>> only with saveAsTable, however can this be used with mode "append" instead
>> of "overwrite"?
>>
>>
>>
>> My understanding around bucketing was, you need to rewrite entire table
>> every time, can someone help advice?
>>
>>
>>
>> --
>>
>> Regards,
>>
>>
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>


Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-31 Thread Rishi Shah
Thanks much for your input Gourav, Silvio.

I have about 10TB of data, which gets stored daily. There's no qualifying
column for partitioning, which makes querying this table super slow. So I
wanted to sort the results before storing them daily. This is why I was
thinking to use bucketing and sorting ... Do you think sorting data based
on a column or two before saving would help query performance on this
table?

My concern is, data will be sorted on daily basis and not globally. Would
that help with performance? I can compact files every month as well and
sort before saving. Just not sure if this is going to help with performance
issues on this table.

Would be great to get your advice on this.









On Fri, May 31, 2019 at 10:42 AM Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Spark does allow appending new files to bucketed tables. When the data is
> read in, Spark will combine the multiple files belonging to the same
> buckets into the same partitions.
>
>
>
> Having said that, you need to be very careful with bucketing especially as
> you’re appending to avoid generating lots of small files. So, you may need
> to consider periodically running a compaction job.
>
>
>
> If you’re simply appending daily snapshots, then you could just consider
> using date partitions, instead?
>
>
>
> *From: *Rishi Shah 
> *Date: *Thursday, May 30, 2019 at 10:43 PM
> *To: *"user @spark" 
> *Subject: *[pyspark 2.3+] Bucketing with sort - incremental data load?
>
>
>
> Hi All,
>
>
>
> Can we use bucketing with sorting functionality to save data incrementally
> (say daily) ? I understand bucketing is supported in Spark only with
> saveAsTable, however can this be used with mode "append" instead of
> "overwrite"?
>
>
>
> My understanding around bucketing was, you need to rewrite entire table
> every time, can someone help advice?
>
>
>
> --
>
> Regards,
>
>
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah


Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread shane knapp
+1000  ;)

On Sat, Jun 1, 2019 at 6:53 AM Denny Lee  wrote:

> +1
>
> On Fri, May 31, 2019 at 17:58 Holden Karau  wrote:
>
>> +1
>>
>> On Fri, May 31, 2019 at 5:41 PM Bryan Cutler  wrote:
>>
>>> +1 and the draft sounds good
>>>
>>> On Thu, May 30, 2019, 11:32 AM Xiangrui Meng  wrote:
>>>
 Here is the draft announcement:

 ===
 Plan for dropping Python 2 support

 As many of you already knew, Python core development team and many
 utilized Python packages like Pandas and NumPy will drop Python 2 support
 in or before 2020/01/01. Apache Spark has supported both Python 2 and 3
 since Spark 1.4 release in 2015. However, maintaining Python 2/3
 compatibility is an increasing burden and it essentially limits the use of
 Python 3 features in Spark. Given the end of life (EOL) of Python 2 is
 coming, we plan to eventually drop Python 2 support as well. The current
 plan is as follows:

 * In the next major release in 2019, we will deprecate Python 2
 support. PySpark users will see a deprecation warning if Python 2 is used.
 We will publish a migration guide for PySpark users to migrate to Python 3.
 * We will drop Python 2 support in a future release in 2020, after
 Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is
 used.
 * For releases that support Python 2, e.g., Spark 2.4, their patch
 releases will continue supporting Python 2. However, after Python 2 EOL, we
 might not take patches that are specific to Python 2.
 ===

 Sean helped make a pass. If it looks good, I'm going to upload it to
 Spark website and announce it here. Let me know if you think we should do a
 VOTE instead.

 On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:

> I created https://issues.apache.org/jira/browse/SPARK-27884 to track
> the work.
>
> On Thu, May 30, 2019 at 2:18 AM Felix Cheung <
> felixcheun...@hotmail.com> wrote:
>
>> We don’t usually reference a future release on website
>>
>> > Spark website and state that Python 2 is deprecated in Spark 3.0
>>
>> I suspect people will then ask when is Spark 3.0 coming out then.
>> Might need to provide some clarity on that.
>>
>
> We can say the "next major release in 2019" instead of Spark 3.0.
> Spark 3.0 timeline certainly requires a new thread to discuss.
>
>
>>
>>
>> --
>> *From:* Reynold Xin 
>> *Sent:* Thursday, May 30, 2019 12:59:14 AM
>> *To:* shane knapp
>> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen;
>> Wenchen Fen; Xiangrui Meng; dev; user
>> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>>
>> +1 on Xiangrui’s plan.
>>
>> On Thu, May 30, 2019 at 7:55 AM shane knapp 
>> wrote:
>>
>>> I don't have a good sense of the overhead of continuing to support
 Python 2; is it large enough to consider dropping it in Spark 3.0?

 from the build/test side, it will actually be pretty easy to
>>> continue support for python2.7 for spark 2.x as the feature sets won't 
>>> be
>>> expanding.
>>>
>>
>>> that being said, i will be cracking a bottle of champagne when i can
>>> delete all of the ansible and anaconda configs for python2.x.  :)
>>>
>>
> On the development side, in a future release that drops Python 2
> support we can remove code that maintains python 2/3 compatibility and
> start using python 3 only features, which is also quite exciting.
>
>
>>
>>> shane
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Denny Lee
+1

On Fri, May 31, 2019 at 17:58 Holden Karau  wrote:

> +1
>
> On Fri, May 31, 2019 at 5:41 PM Bryan Cutler  wrote:
>
>> +1 and the draft sounds good
>>
>> On Thu, May 30, 2019, 11:32 AM Xiangrui Meng  wrote:
>>
>>> Here is the draft announcement:
>>>
>>> ===
>>> Plan for dropping Python 2 support
>>>
>>> As many of you already knew, Python core development team and many
>>> utilized Python packages like Pandas and NumPy will drop Python 2 support
>>> in or before 2020/01/01. Apache Spark has supported both Python 2 and 3
>>> since Spark 1.4 release in 2015. However, maintaining Python 2/3
>>> compatibility is an increasing burden and it essentially limits the use of
>>> Python 3 features in Spark. Given the end of life (EOL) of Python 2 is
>>> coming, we plan to eventually drop Python 2 support as well. The current
>>> plan is as follows:
>>>
>>> * In the next major release in 2019, we will deprecate Python 2 support.
>>> PySpark users will see a deprecation warning if Python 2 is used. We will
>>> publish a migration guide for PySpark users to migrate to Python 3.
>>> * We will drop Python 2 support in a future release in 2020, after
>>> Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is
>>> used.
>>> * For releases that support Python 2, e.g., Spark 2.4, their patch
>>> releases will continue supporting Python 2. However, after Python 2 EOL, we
>>> might not take patches that are specific to Python 2.
>>> ===
>>>
>>> Sean helped make a pass. If it looks good, I'm going to upload it to
>>> Spark website and announce it here. Let me know if you think we should do a
>>> VOTE instead.
>>>
>>> On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:
>>>
 I created https://issues.apache.org/jira/browse/SPARK-27884 to track
 the work.

 On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
 wrote:

> We don’t usually reference a future release on website
>
> > Spark website and state that Python 2 is deprecated in Spark 3.0
>
> I suspect people will then ask when is Spark 3.0 coming out then.
> Might need to provide some clarity on that.
>

 We can say the "next major release in 2019" instead of Spark 3.0. Spark
 3.0 timeline certainly requires a new thread to discuss.


>
>
> --
> *From:* Reynold Xin 
> *Sent:* Thursday, May 30, 2019 12:59:14 AM
> *To:* shane knapp
> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
> Fen; Xiangrui Meng; dev; user
> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>
> +1 on Xiangrui’s plan.
>
> On Thu, May 30, 2019 at 7:55 AM shane knapp 
> wrote:
>
>> I don't have a good sense of the overhead of continuing to support
>>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>>
>>> from the build/test side, it will actually be pretty easy to
>> continue support for python2.7 for spark 2.x as the feature sets won't be
>> expanding.
>>
>
>> that being said, i will be cracking a bottle of champagne when i can
>> delete all of the ansible and anaconda configs for python2.x.  :)
>>
>
 On the development side, in a future release that drops Python 2
 support we can remove code that maintains python 2/3 compatibility and
 start using python 3 only features, which is also quite exciting.


>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Holden Karau
+1

On Fri, May 31, 2019 at 5:41 PM Bryan Cutler  wrote:

> +1 and the draft sounds good
>
> On Thu, May 30, 2019, 11:32 AM Xiangrui Meng  wrote:
>
>> Here is the draft announcement:
>>
>> ===
>> Plan for dropping Python 2 support
>>
>> As many of you already knew, Python core development team and many
>> utilized Python packages like Pandas and NumPy will drop Python 2 support
>> in or before 2020/01/01. Apache Spark has supported both Python 2 and 3
>> since Spark 1.4 release in 2015. However, maintaining Python 2/3
>> compatibility is an increasing burden and it essentially limits the use of
>> Python 3 features in Spark. Given the end of life (EOL) of Python 2 is
>> coming, we plan to eventually drop Python 2 support as well. The current
>> plan is as follows:
>>
>> * In the next major release in 2019, we will deprecate Python 2 support.
>> PySpark users will see a deprecation warning if Python 2 is used. We will
>> publish a migration guide for PySpark users to migrate to Python 3.
>> * We will drop Python 2 support in a future release in 2020, after Python
>> 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is used.
>> * For releases that support Python 2, e.g., Spark 2.4, their patch
>> releases will continue supporting Python 2. However, after Python 2 EOL, we
>> might not take patches that are specific to Python 2.
>> ===
>>
>> Sean helped make a pass. If it looks good, I'm going to upload it to
>> Spark website and announce it here. Let me know if you think we should do a
>> VOTE instead.
>>
>> On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:
>>
>>> I created https://issues.apache.org/jira/browse/SPARK-27884 to track
>>> the work.
>>>
>>> On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
>>> wrote:
>>>
 We don’t usually reference a future release on website

 > Spark website and state that Python 2 is deprecated in Spark 3.0

 I suspect people will then ask when is Spark 3.0 coming out then. Might
 need to provide some clarity on that.

>>>
>>> We can say the "next major release in 2019" instead of Spark 3.0. Spark
>>> 3.0 timeline certainly requires a new thread to discuss.
>>>
>>>


 --
 *From:* Reynold Xin 
 *Sent:* Thursday, May 30, 2019 12:59:14 AM
 *To:* shane knapp
 *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
 Fen; Xiangrui Meng; dev; user
 *Subject:* Re: Should python-2 be supported in Spark 3.0?

 +1 on Xiangrui’s plan.

 On Thu, May 30, 2019 at 7:55 AM shane knapp 
 wrote:

> I don't have a good sense of the overhead of continuing to support
>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>
>> from the build/test side, it will actually be pretty easy to continue
> support for python2.7 for spark 2.x as the feature sets won't be 
> expanding.
>

> that being said, i will be cracking a bottle of champagne when i can
> delete all of the ansible and anaconda configs for python2.x.  :)
>

>>> On the development side, in a future release that drops Python 2 support
>>> we can remove code that maintains python 2/3 compatibility and start using
>>> python 3 only features, which is also quite exciting.
>>>
>>>

> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Bryan Cutler
+1 and the draft sounds good

On Thu, May 30, 2019, 11:32 AM Xiangrui Meng  wrote:

> Here is the draft announcement:
>
> ===
> Plan for dropping Python 2 support
>
> As many of you already knew, Python core development team and many
> utilized Python packages like Pandas and NumPy will drop Python 2 support
> in or before 2020/01/01. Apache Spark has supported both Python 2 and 3
> since Spark 1.4 release in 2015. However, maintaining Python 2/3
> compatibility is an increasing burden and it essentially limits the use of
> Python 3 features in Spark. Given the end of life (EOL) of Python 2 is
> coming, we plan to eventually drop Python 2 support as well. The current
> plan is as follows:
>
> * In the next major release in 2019, we will deprecate Python 2 support.
> PySpark users will see a deprecation warning if Python 2 is used. We will
> publish a migration guide for PySpark users to migrate to Python 3.
> * We will drop Python 2 support in a future release in 2020, after Python
> 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is used.
> * For releases that support Python 2, e.g., Spark 2.4, their patch
> releases will continue supporting Python 2. However, after Python 2 EOL, we
> might not take patches that are specific to Python 2.
> ===
>
> Sean helped make a pass. If it looks good, I'm going to upload it to Spark
> website and announce it here. Let me know if you think we should do a VOTE
> instead.
>
> On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:
>
>> I created https://issues.apache.org/jira/browse/SPARK-27884 to track the
>> work.
>>
>> On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
>> wrote:
>>
>>> We don’t usually reference a future release on website
>>>
>>> > Spark website and state that Python 2 is deprecated in Spark 3.0
>>>
>>> I suspect people will then ask when is Spark 3.0 coming out then. Might
>>> need to provide some clarity on that.
>>>
>>
>> We can say the "next major release in 2019" instead of Spark 3.0. Spark
>> 3.0 timeline certainly requires a new thread to discuss.
>>
>>
>>>
>>>
>>> --
>>> *From:* Reynold Xin 
>>> *Sent:* Thursday, May 30, 2019 12:59:14 AM
>>> *To:* shane knapp
>>> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
>>> Fen; Xiangrui Meng; dev; user
>>> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>>>
>>> +1 on Xiangrui’s plan.
>>>
>>> On Thu, May 30, 2019 at 7:55 AM shane knapp  wrote:
>>>
 I don't have a good sense of the overhead of continuing to support
> Python 2; is it large enough to consider dropping it in Spark 3.0?
>
> from the build/test side, it will actually be pretty easy to continue
 support for python2.7 for spark 2.x as the feature sets won't be expanding.

>>>
 that being said, i will be cracking a bottle of champagne when i can
 delete all of the ansible and anaconda configs for python2.x.  :)

>>>
>> On the development side, in a future release that drops Python 2 support
>> we can remove code that maintains python 2/3 compatibility and start using
>> python 3 only features, which is also quite exciting.
>>
>>
>>>
 shane
 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>


Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-31 Thread Silvio Fiorito
Spark does allow appending new files to bucketed tables. When the data is read 
in, Spark will combine the multiple files belonging to the same buckets into 
the same partitions.

Having said that, you need to be very careful with bucketing especially as 
you’re appending to avoid generating lots of small files. So, you may need to 
consider periodically running a compaction job.

If you’re simply appending daily snapshots, then you could just consider using 
date partitions, instead?

From: Rishi Shah 
Date: Thursday, May 30, 2019 at 10:43 PM
To: "user @spark" 
Subject: [pyspark 2.3+] Bucketing with sort - incremental data load?

Hi All,

Can we use bucketing with sorting functionality to save data incrementally (say 
daily) ? I understand bucketing is supported in Spark only with saveAsTable, 
however can this be used with mode "append" instead of "overwrite"?

My understanding around bucketing was, you need to rewrite entire table every 
time, can someone help advice?

--
Regards,

Rishi Shah


java.util.NoSuchElementException: Columns not found

2019-05-31 Thread Shyam P
Trying to save a sample data into C* table

I am getting below error :

*java.util.NoSuchElementException: Columns not found in table
abc.company_vals: companyId, companyName*

Though I have all the columns and re checked them again and again.
I dont see any issue with columns.

I am using spark-cassandra-connector-2_11.jar and spark-sql-2.4.1 version.

https://stackoverflow.com/questions/56393254/java-util-nosuchelementexception-columns-not-found-in-table-abc-company-vals-c


Any suggestions?

Thank you,
Shyam


Re: dynamic allocation in spark-shell

2019-05-31 Thread Deepak Sharma
You can start spark-shell with these properties:
--conf spark.dynamicAllocation.enabled=true --conf
spark.dynamicAllocation.initialExecutors=2 --conf
spark.dynamicAllocation.minExecutors=2 --conf
spark.dynamicAllocation.maxExecutors=5

On Fri, May 31, 2019 at 5:30 AM Qian He  wrote:

> Sometimes it's convenient to start a spark-shell on cluster, like
> ./spark/bin/spark-shell --master yarn --deploy-mode client --num-executors
> 100 --executor-memory 15g --executor-cores 4 --driver-memory 10g --queue
> myqueue
> However, with command like this, those allocated resources will be
> occupied until the console exits.
>
> Just wandering if it is possible to start a spark-shell with
> dynamicAllocation enabled? If it is, how to specify the configs? Can anyone
> give an quick example? Thanks!
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-31 Thread Gourav Sengupta
Hi Rishi,

I think that if you are using sorting and then appending data locally there
will no need to bucket data and you are good with external tables that way.

Regards,
Gourav

On Fri, May 31, 2019 at 3:43 AM Rishi Shah  wrote:

> Hi All,
>
> Can we use bucketing with sorting functionality to save data incrementally
> (say daily) ? I understand bucketing is supported in Spark only with
> saveAsTable, however can this be used with mode "append" instead of
> "overwrite"?
>
> My understanding around bucketing was, you need to rewrite entire table
> every time, can someone help advice?
>
> --
> Regards,
>
> Rishi Shah
>