Would "alter table add column" be supported in the future?

2016-11-09 Thread
Hi,

I notice that “alter table add column” command is banned in spark 2.0.

Any plans on supporting it in the future? (After all it was supported in spark 
1.6.x)

Thanks.
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: shuffle files not deleted after executor restarted

2016-09-02 Thread
Yeah, using external shuffle service is a reasonable choice but I think we will 
still face the same problems. We use SSDs to store shuffle files for 
performance considerations. If the shuffle files are not going to be used 
anymore, we want them to be deleted instead of taking up valuable SSD space.

> 在 2016年9月2日,下午5:40,Artur Sukhenko <artur.sukhe...@gmail.com> 写道:
> 
> Hi Yang,
> 
> Isn't external shuffle service better for long running applications? 
> "It runs as a standalone application and manages shuffle output files so they 
> are available for executors at all time"
> 
> It is described here:
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-ExternalShuffleService.html
>  
> <https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-ExternalShuffleService.html>
> 
> ---
> Artur
> 
> On Fri, Sep 2, 2016 at 12:30 PM 汪洋 <tiandiwo...@icloud.com 
> <mailto:tiandiwo...@icloud.com>> wrote:
> Thank you for you response. 
> 
> We are using spark-1.6.2 on standalone deploy mode with dynamic allocation 
> disabled.
> 
> I have traced the code. IMHO, it seems this cleanup is not handled by 
> shutdown hooks directly. The shutdown hooks only send a 
> “ExecutorStateChanged” message to the worker and if the worker see the 
> message, it will cleanup the directory only when this application is 
> finished. In our case, the application is not finished (long running). The 
> executor exits due to some unknown error and it is restarted by worker right 
> away. In this scenario, those old directories are not going to be deleted. 
> 
> If the application is still running, is it safe to delete the old “blockmgr” 
> directory and leaving only the newest one?
> 
> Our temporary solution is to restart our application regularly and we are 
> seeking a more elegant way. 
> 
> Thanks.
> 
> Yang
> 
> 
>> 在 2016年9月2日,下午4:11,Sun Rui <sunrise_...@163.com 
>> <mailto:sunrise_...@163.com>> 写道:
>> 
>> Hi,
>> Could you give more information about your Spark environment? cluster 
>> manager, spark version, using dynamic allocation or not, etc..
>> 
>> Generally, executors will delete temporary directories for shuffle files on 
>> exit because JVM shutdown hooks are registered. Unless they are brutally 
>> killed.
>> 
>> You can safely delete the directories when you are sure that the spark 
>> applications related to them have finished. A crontab task may be used for 
>> automatic clean up.
>> 
>>> On Sep 2, 2016, at 12:18, 汪洋 <tiandiwo...@icloud.com 
>>> <mailto:tiandiwo...@icloud.com>> wrote:
>>> 
>>> Hi all,
>>> 
>>> I discovered that sometimes executor exits unexpectedly and when it is 
>>> restarted, it will create another blockmgr directory without deleting the 
>>> old ones. Thus, for a long running application, some shuffle files will 
>>> never be cleaned up. Sometimes those files could take up the whole disk. 
>>> 
>>> Is there a way to clean up those unused file automatically? Or is it safe 
>>> to delete the old directory manually only leaving the newest one?
>>> 
>>> Here is the executor’s local directory.
>>> 
>>> 
>>> Any advice on this?
>>> 
>>> Thanks.
>>> 
>>> Yang
>> 
>> 
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> <mailto:dev-unsubscr...@spark.apache.org>
>> 
> 
> -- 
> --
> Artur Sukhenko



Re: rdd.distinct with Partitioner

2016-06-08 Thread
Frankly speaking, I think reduceByKey with Partitioner has the same problem too 
and it should not be exposed to public user either. Because it is a little hard 
to fully understand how the partitioner behaves without looking at the actual 
code.  

And if there exits a basic contract of a Partitioner, maybe it should be stated 
explicitly in the document if not enforced by code.

However, I don’t feel too strong to argue about this issue except stating my 
concern. It will not cause too much trouble anyway once users learn the 
semantics. Just a judgement call by the API designer.


> 在 2016年6月9日,下午12:51,Alexander Pivovarov <apivova...@gmail.com> 写道:
> 
> reduceByKey(randomPartitioner, (a, b) => a + b) also gives incorrect result 
> 
> Why reduceByKey with Partitioner exists then?
> 
> On Wed, Jun 8, 2016 at 9:22 PM, 汪洋 <tiandiwo...@icloud.com 
> <mailto:tiandiwo...@icloud.com>> wrote:
> Hi Alexander,
> 
> I think it does not guarantee to be right if an arbitrary Partitioner is 
> passed in.
> 
> I have created a notebook and you can check it out. 
> (https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html
>  
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html>)
> 
> Best regards,
> 
> Yang
> 
> 
>> 在 2016年6月9日,上午11:42,Alexander Pivovarov <apivova...@gmail.com 
>> <mailto:apivova...@gmail.com>> 写道:
>> 
>> most of the RDD methods which shuffle data take Partitioner as a parameter
>> 
>> But rdd.distinct does not have such signature
>> 
>> Should I open a PR for that?
>> 
>> /**
>>  * Return a new RDD containing the distinct elements in this RDD.
>>  */
>> def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null): 
>> RDD[T] = withScope {
>>   map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
>> }
> 
> 



Re: rdd.distinct with Partitioner

2016-06-08 Thread
Hi Alexander,

I think it does not guarantee to be right if an arbitrary Partitioner is passed 
in.

I have created a notebook and you can check it out. 
(https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html
 
)

Best regards,

Yang


> 在 2016年6月9日,上午11:42,Alexander Pivovarov  写道:
> 
> most of the RDD methods which shuffle data take Partitioner as a parameter
> 
> But rdd.distinct does not have such signature
> 
> Should I open a PR for that?
> 
> /**
>  * Return a new RDD containing the distinct elements in this RDD.
>  */
> def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null): 
> RDD[T] = withScope {
>   map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
> }



HiveContext.refreshTable() missing in spark 2.0

2016-05-13 Thread
Hi all,

I notice that HiveContext used to have a refreshTable() method, but it doesn’t 
in branch-2.0. 

Do we drop that intentionally? If yes, how do we achieve similar functionality?

Thanks.

Yang

TakeOrderedAndProject operator may causes an OOM

2016-02-03 Thread
Hi,

Currently the TakeOrderedAndProject operator in spark sql uses RDD’s 
takeOrdered method. When we pass a large limit to operator, however, it will 
return partitionNum*limit number of records to the driver which may cause an 
OOM.

Are there any plans to deal with the problem in the community? 


Thanks.


Yang

Using distinct count in over clause

2016-01-22 Thread
Hi,

Do we support distinct count in the over clause in spark sql? 

I ran a sql like this:

select a, count(distinct b) over ( order by a rows between unbounded preceding 
and current row) from table limit 10

Currently, it return an error says: expression ‘a' is neither present in the 
group by, nor is it an aggregate function. Add to group by or wrap in first() 
if you don't care which value you get.;

Yang



Re: Using distinct count in over clause

2016-01-22 Thread
I think it cannot be right.

> 在 2016年1月22日,下午4:53,汪洋 <tiandiwo...@icloud.com> 写道:
> 
> Hi,
> 
> Do we support distinct count in the over clause in spark sql? 
> 
> I ran a sql like this:
> 
> select a, count(distinct b) over ( order by a rows between unbounded 
> preceding and current row) from table limit 10
> 
> Currently, it return an error says: expression ‘a' is neither present in the 
> group by, nor is it an aggregate function. Add to group by or wrap in first() 
> if you don't care which value you get.;
> 
> Yang
> 



Re: problem with reading source code-pull out nondeterministic expresssions

2015-12-31 Thread
I get it, thanks!

> 在 2015年12月31日,上午3:00,Michael Armbrust <mich...@databricks.com> 写道:
> 
> The goal here is to ensure that the non-deterministic value is evaluated only 
> once, so the result won't change for a given row (i.e. when sorting).
> 
> On Tue, Dec 29, 2015 at 10:57 PM, 汪洋 <tiandiwo...@icloud.com 
> <mailto:tiandiwo...@icloud.com>> wrote:
> Hi fellas,
> I am new to spark and I have a newbie question. I am currently reading the 
> source code in spark sql catalyst analyzer. I not quite understand the 
> partial function in PullOutNondeterministric. What does it mean by "pull 
> out”? Why do we have to do the "pulling out”?
> I would really appreciate it if somebody explain it to me. 
> Thanks. 
> 



problem with reading source code-pull out nondeterministic expresssions

2015-12-29 Thread
Hi fellas,
I am new to spark and I have a newbie question. I am currently reading the 
source code in spark sql catalyst analyzer. I not quite understand the partial 
function in PullOutNondeterministric. What does it mean by "pull out”? Why do 
we have to do the "pulling out”?
I would really appreciate it if somebody explain it to me. 
Thanks.