from:"Ted Yu"

Re: welcome a new batch of committers

2018-10-03 Thread Ted Yu

Congratulations to all !
 Original message From: Jungtaek Lim  Date: 
10/3/18  2:41 AM  (GMT-08:00) To: Marco Gaido  Cc: dev 
 Subject: Re: welcome a new batch of committers 
Congrats all! You all deserved it.
On Wed, 3 Oct 2018 at 6:35 PM Marco Gaido  wrote:
Congrats you all!

Il giorno mer 3 ott 2018 alle ore 11:29 Liang-Chi Hsieh  ha 
scritto:

Congratulations to all new committers!

rxin wrote

> Hi all,

> 

> The Apache Spark PMC has recently voted to add several new committers to

> the project, for their contributions:

> 

> - Shane Knapp (contributor to infra)

> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)

> - Kazuaki Ishizaki (contributor to Spark SQL)

> - Xingbo Jiang (contributor to Spark Core and SQL)

> - Yinan Li (contributor to Spark on Kubernetes)

> - Takeshi Yamamuro (contributor to Spark SQL)

> 

> Please join me in welcoming them!

--

Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-

To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-01 Thread Ted Yu

+1
 Original message From: Denny Lee  Date: 
9/30/18  10:30 PM  (GMT-08:00) To: Stavros Kontopoulos 
 Cc: Sean Owen , Wenchen 
Fan , dev  Subject: Re: [VOTE] SPARK 
2.4.0 (RC2) 
+1 (non-binding)


On Sat, Sep 29, 2018 at 10:24 AM Stavros Kontopoulos 
 wrote:
+1
Stavros
On Sat, Sep 29, 2018 at 5:59 AM, Sean Owen  wrote:
+1, with comments:



There are 5 critical issues for 2.4, and no blockers:

SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4

SPARK-25325 ML, Graph 2.4 QA: Update user guide for new features & APIs

SPARK-25319 Spark MLlib, GraphX 2.4 QA umbrella

SPARK-25326 ML, Graph 2.4 QA: Programming guide update and migration guide

SPARK-25323 ML 2.4 QA: API: Python API coverage



Xiangrui, is SPARK-25378 important enough we need to get it into 2.4?



I found two issues resolved for 2.4.1 that got into this RC, so marked

them as resolved in 2.4.0.



I checked the licenses and notice and they look correct now in source

and binary builds.



The 2.12 artifacts are as I'd expect.



I ran all tests for 2.11 and 2.12 and they pass with -Pyarn

-Pkubernetes -Pmesos -Phive -Phadoop-2.7 -Pscala-2.12.









On Thu, Sep 27, 2018 at 10:00 PM Wenchen Fan  wrote:

>

> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.0.

>

> The vote is open until October 1 PST and passes if a majority +1 PMC votes 
> are cast, with

> a minimum of 3 +1 votes.

>

> [ ] +1 Release this package as Apache Spark 2.4.0

> [ ] -1 Do not release this package because ...

>

> To learn more about Apache Spark, please see http://spark.apache.org/

>

> The tag to be voted on is v2.4.0-rc2 (commit 
> 42f25f309e91c8cde1814e3720099ac1e64783da):

> https://github.com/apache/spark/tree/v2.4.0-rc2

>

> The release files, including signatures, digests, etc. can be found at:

> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc2-bin/

>

> Signatures used for Spark RCs can be found in this file:

> https://dist.apache.org/repos/dist/dev/spark/KEYS

>

> The staging repository for this release can be found at:

> https://repository.apache.org/content/repositories/orgapachespark-1287

>

> The documentation corresponding to this release can be found at:

> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc2-docs/

>

> The list of bug fixes going into 2.4.0 can be found at the following URL:

> https://issues.apache.org/jira/projects/SPARK/versions/2.4.0

>

> FAQ

>

> =

> How can I help test this release?

> =

>

> If you are a Spark user, you can help us test this release by taking

> an existing Spark workload and running on this release candidate, then

> reporting any regressions.

>

> If you're working in PySpark you can set up a virtual env and install

> the current RC and see if anything important breaks, in the Java/Scala

> you can add the staging repository to your projects resolvers and test

> with the RC (make sure to clean up the artifact cache before/after so

> you don't end up building with a out of date RC going forward).

>

> ===

> What should happen to JIRA tickets still targeting 2.4.0?

> ===

>

> The current list of open tickets targeted at 2.4.0 can be found at:

> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.0

>

> Committers should look at those and triage. Extremely important bug

> fixes, documentation, and API tweaks that impact compatibility should

> be worked on immediately. Everything else please retarget to an

> appropriate release.

>

> ==

> But my bug isn't fixed?

> ==

>

> In order to make timely releases, we will typically not hold the

> release unless the bug in question is a regression from the previous

> release. That being said, if there is something which is a regression

> that has not been correctly targeted please ping me or a committer to

> help target the issue.



-

To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: from_csv

2018-09-19 Thread Ted Yu

+1
 Original message From: Dongjin Lee  Date: 
9/19/18  7:20 AM  (GMT-08:00) To: dev  Subject: Re: 
from_csv 
Another +1.
I already experienced this case several times.

On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon  wrote:
+1 for this idea since text parsing in CSV/JSON is quite common.
One thing is about schema inference likewise with JSON functionality. In case 
of JSON, we added schema_of_json for it and same thing should be able to apply 
to CSV too.
If we see some more needs for it, we can consider a function like schema_of_csv 
as well.

2018년 9월 16일 (일) 오후 4:41, Maxim Gekk 님이 작성:
Hi Reynold,
> i'd make this as consistent as to_json / from_json as possible
Sure, new function from_csv() has the same signature as from_json().
> how would this work in sql? i.e. how would passing options in work?
The options are passed to the function via map, for example:select 
from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/'))

On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin  wrote:
makes sense - i'd make this as consistent as to_json / from_json as possible. 
how would this work in sql? i.e. how would passing options in work?
--excuse the brevity and lower case due to wrist injury

On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk  wrote:
Hi All,
I would like to propose new function from_csv() for parsing columns containing 
strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379
An use case is loading a dataset from an external storage, dbms or systems like 
Kafka to where CSV content was dumped as one of columns/fields. Other columns 
could contain related information like timestamps, ids, sources of data and 
etc. The column with CSV strings can be parsed by existing method csv() of 
DataFrameReader but in that case we have to "clean up" dataset and remove other 
columns since the csv() method requires Dataset[String]. Joining back result of 
parsing and original dataset by positions is expensive and not convenient. 
Instead users parse CSV columns by string functions. The approach is usually 
error prone especially for quoted values and other special cases.
The proposed in the PR methods should make a better user experience in parsing 
CSV-like columns. Please, share your thoughts.
-- 

Maxim Gekk
Technical Solutions LeadDatabricks inc.maxim.g...@databricks.comdatabricks.com 






-- 
Dongjin Lee
A hitchhiker in the mathematical world.
github: github.com/dongjinleekrlinkedin: 
kr.linkedin.com/in/dongjinleekrslideshare: www.slideshare.net/dongjinleekr

Re: Upgrade SBT to the latest

2018-08-31 Thread Ted Yu

+1
 Original message From: Sean Owen  Date: 
8/31/18  6:40 AM  (GMT-08:00) To: Darcy Shen  Cc: 
dev@spark.apache.org Subject: Re: Upgrade SBT to the latest 
Certainly worthwhile. I think this should target Spark 3, which should come 
after 2.4, which is itself already just about ready to test and release.
On Fri, Aug 31, 2018 at 8:16 AM Darcy Shen  wrote:

SBT 1.x is ready for a long time.

We may spare some time upgrading sbt for Spark.

An unbrella JIRA like Scala 2.12 should be created.

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Ted Yu

+1
 Original message From: Reynold Xin  Date: 
8/30/18  11:11 PM  (GMT-08:00) To: Felix Cheung  Cc: 
dev  Subject: Re: SPIP: Executor Plugin (SPARK-24918) 
I actually had a similar use case a while ago, but not entirely the same. In my 
use case, Spark is already up, but I want to make sure all existing (and new) 
executors run some specific code. Can we update the API to support that? I 
think that's doable if we split the design into two: one is the ability to do 
what I just mentioned, and second is the ability to register via config class 
when Spark starts to run the code.

On Thu, Aug 30, 2018 at 11:01 PM Felix Cheung  wrote:














+1



From: Mridul Muralidharan 

Sent: Wednesday, August 29, 2018 1:27:27 PM

To: dev@spark.apache.org

Subject: Re: SPIP: Executor Plugin (SPARK-24918)
 



+1

I left a couple of comments in NiharS's PR, but this is very useful to

have in spark !



Regards,

Mridul

On Fri, Aug 3, 2018 at 10:00 AM Imran Rashid

 wrote:

>

> I'd like to propose adding a plugin api for Executors, primarily for 
> instrumentation and debugging 
> (https://issues.apache.org/jira/browse/SPARK-24918).  The changes are small, 
> but as its adding
 a new api, it might be spip-worthy.  I mentioned it as well in a recent email 
I sent about memory monitoring

>

> The spip proposal is here (and attached to the jira as well): 
https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit?usp=sharing

>

> There are already some comments on the jira and pr, and I hope to get more 
> thoughts and opinions on it.

>

> thanks,

> Imran



-

To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark Kafka adapter questions

2018-08-20 Thread Ted Yu

n$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:189)
>
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:172)
>
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:172)
>
> at
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:379)
>
> at
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:60)
>
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:172)
>
> at
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166)
>
> at
> org.apache.spark.sql.execution.streaming.StreamExecution.org
> $apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293)
>
> at
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203)
>
> 18/08/20 22:29:33 INFO AbstractCoordinator: Marking the coordinator
> :9093 (id: 2147483647 rack: null) dead for group
> spark-kafka-source-1aa50598-99d1-4c53-a73c-fa6637a219b2--1338794993-driver-0
>
> 18/08/20 22:29:34 INFO AbstractCoordinator: Discovered coordinator  FQDN>:9093 (id: 2147483647 rack: null) for group
> spark-kafka-source-1aa50598-99d1-4c53-a73c-fa6637a219b2--1338794993-driver-0.
>
> 
>
>
>
> Also, I’m not sure if it’s relevant but I am running on Databricks
> (currently working on running it on a local cluster to verify that it isn’t
> a Databricks issue). The only jars I’m using are the Spark-Kafka connector
> from github master on 8/8/18 and Kafka v2.0. Thanks so much for your help,
> let me know if there’s anything else I can provide
>
>
>
> Sincerely,
>
> Basil
>
>
>
> *From:* Ted Yu 
> *Sent:* Friday, August 17, 2018 4:20 PM
> *To:* basil.har...@microsoft.com.invalid
> *Cc:* dev 
> *Subject:* Re: Spark Kafka adapter questions
>
>
>
> If you have picked up all the changes for SPARK-18057, the Kafka “broker”
> supporting v1.0+ should be compatible with Spark's Kafka adapter.
>
>
>
> Can you post more details about the “failed to send SSL close message”
> errors ?
>
>
>
> (The default Kafka version is 2.0.0 in Spark Kafka adapter
> after SPARK-18057)
>
>
>
> Thanks
>
>
>
> On Fri, Aug 17, 2018 at 3:53 PM Basil Hariri <
> basil.har...@microsoft.com.invalid> wrote:
>
> Hi all,
>
>
>
> I work on Azure Event Hubs (Microsoft’s PaaS offering similar to Apache
> Kafka) and am trying to get our new Kafka head
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fazure.microsoft.com%2Fen-us%2Fblog%2Fazure-event-hubs-for-kafka-ecosystems-in-public-preview%2F=02%7C01%7CBasil.Hariri%40microsoft.com%7C9c2387763e53418d4b4e08d6049813a9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636701448547293693=kNuSO1yNNJzOOyg%2FDRlyv4ZKB568f%2FKKn0zCnWQDK0A%3D=0>
> to play nice with Spark’s Kafka adapter. The goal is for our Kafka endpoint
> to be completely compatible with Spark’s Kafka adapter, but I’m running
> into some issues that I think are related to versioning. I’ve been trying
> to tinker with the kafka-0-10-sql
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Ftree%2Fmaster%2Fexternal%2Fkafka-0-10-sql=02%7C01%7CBasil.Hariri%40microsoft.com%7C9c2387763e53418d4b4e08d6049813a9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636701448547293693=s5BoYXcUhrVb5uaj3Y2soxjn8Zm3LFVOyGD8bwDZkkM%3D=0>
> and kafka-0-10
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Ftree%2Fmaster%2Fexternal%2Fkafka-0-10-sql=02%7C01%7CBasil.Hariri%40microsoft.com%7C9c2387763e53418d4b4e08d6049813a9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636701448547303703=5H9%2FFGxz1VsL0OfWx7mrsQU2cGIR7zB3VuMADZop9RE%3D=0>
> adapters on Github and was wondering if someone could take a second to
> point me in the right direction with:
>
>
>
>1. What is the difference between those two adapters? My hunch is that
>kafka-0-10-sql supports structured streaming while kafka-10-0 still uses
>Spark streaming, but I haven’t found anything to verify that.
>2. Event Hubs’ Kafka endpoint only supports Kafka 1.0 and later, and
>the errors I get when trying to connect to Spark (“failed to send SSL close
>message” / broken pipe errors) have usually shown up when using Kafka v0.10
>applications with our endpoint. I built from source after I saw that both
>libraries were updated for Kafka 2.0 support (late last week), but I’m
>still running into the same issues. Do Spark’s Kafka adapters generally
>downgrade to Kafka v0.10 protocols? If not, is there any other reason to
>believe that a Kafka “broker” that doesn’t support v0.10 protocols but
>supports v1.0+ would be incompatible with Spark’s Kafka adapter?
>
>
>
> Thanks in advance, please let me know if there’s a different place I
> should be posting this
>
>
>
> Sincerely,
>
> Basil
>
>
>
>

Re: Spark Kafka adapter questions

2018-08-17 Thread Ted Yu

If you have picked up all the changes for SPARK-18057, the Kafka “broker”
supporting v1.0+ should be compatible with Spark's Kafka adapter.

Can you post more details about the “failed to send SSL close message”
errors ?

(The default Kafka version is 2.0.0 in Spark Kafka adapter after SPARK-18057
)

Thanks

On Fri, Aug 17, 2018 at 3:53 PM Basil Hariri
 wrote:

> Hi all,
>
>
>
> I work on Azure Event Hubs (Microsoft’s PaaS offering similar to Apache
> Kafka) and am trying to get our new Kafka head
> 
> to play nice with Spark’s Kafka adapter. The goal is for our Kafka endpoint
> to be completely compatible with Spark’s Kafka adapter, but I’m running
> into some issues that I think are related to versioning. I’ve been trying
> to tinker with the kafka-0-10-sql
>  and
> kafka-0-10
> 
> adapters on Github and was wondering if someone could take a second to
> point me in the right direction with:
>
>
>
>1. What is the difference between those two adapters? My hunch is that
>kafka-0-10-sql supports structured streaming while kafka-10-0 still uses
>Spark streaming, but I haven’t found anything to verify that.
>2. Event Hubs’ Kafka endpoint only supports Kafka 1.0 and later, and
>the errors I get when trying to connect to Spark (“failed to send SSL close
>message” / broken pipe errors) have usually shown up when using Kafka v0.10
>applications with our endpoint. I built from source after I saw that both
>libraries were updated for Kafka 2.0 support (late last week), but I’m
>still running into the same issues. Do Spark’s Kafka adapters generally
>downgrade to Kafka v0.10 protocols? If not, is there any other reason to
>believe that a Kafka “broker” that doesn’t support v0.10 protocols but
>supports v1.0+ would be incompatible with Spark’s Kafka adapter?
>
>
>
> Thanks in advance, please let me know if there’s a different place I
> should be posting this
>
>
>
> Sincerely,
>
> Basil
>
>
>

Re: 回复： Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Ted Yu

Congratulations, Zhenhua 
 Original message From: 雨中漫步 <601450...@qq.com> Date: 4/1/18  
11:30 PM  (GMT-08:00) To: Yuanjian Li , Wenchen Fan 
 Cc: dev  Subject: 回复： Welcome 
Zhenhua Wang as a Spark committer 
Congratulations Zhenhua Wang

-- 原始邮件 --发件人: "Yuanjian 
Li";发送时间: 2018年4月2日(星期一) 下午2:26收件人: "Wenchen 
Fan";抄送: "Spark dev list"; 主题: Re: 
Welcome Zhenhua Wang as a Spark committer
Congratulations Zhenhua!!

2018-04-02 13:28 GMT+08:00 Wenchen Fan :
Hi all,
The Spark PMC recently added Zhenhua Wang as a committer on the project. 
Zhenhua is the major contributor of the CBO project, and has been contributing 
across several areas of Spark for a while, focusing especially on analyzer, 
optimizer in Spark SQL. Please join me in welcoming Zhenhua!
Wenchen

Re: DataSourceV2 write input requirements

2018-03-30 Thread Ted Yu

+1
 Original message From: Ryan Blue <rb...@netflix.com> Date: 
3/30/18  2:28 PM  (GMT-08:00) To: Patrick Woody <patrick.woo...@gmail.com> Cc: 
Russell Spitzer <russell.spit...@gmail.com>, Wenchen Fan <cloud0...@gmail.com>, 
Ted Yu <yuzhih...@gmail.com>, Spark Dev List <dev@spark.apache.org> Subject: 
Re: DataSourceV2 write input requirements 
You're right. A global sort would change the clustering if it had more fields 
than the clustering.
Then what about this: if there is no RequiredClustering, then the sort is a 
global sort. If RequiredClustering is present, then the clustering is applied 
and the sort is a partition-level sort.
That rule would mean that within a partition you always get the sort, but an 
explicit clustering overrides the partitioning a sort might try to introduce. 
Does that sound reasonable?
rb
On Fri, Mar 30, 2018 at 12:39 PM, Patrick Woody <patrick.woo...@gmail.com> 
wrote:
Does that methodology work in this specific case? The ordering must be a subset 
of the clustering to guarantee they exist in the same partition when doing a 
global sort I thought. Though I get the gist that if it does satisfy, then 
there is no reason to not choose the global sort.

On Fri, Mar 30, 2018 at 1:31 PM, Ryan Blue <rb...@netflix.com> wrote:
> Can you expand on how the ordering containing the clustering expressions 
>would ensure the global sort?
The idea was to basically assume that if the clustering can be satisfied by a 
global sort, then do the global sort. For example, if the clustering is 
Set("b", "a") and the sort is Seq("a", "b", "c") then do a global sort by 
columns a, b, and c.
Technically, you could do this with a hash partitioner instead of a range 
partitioner and sort within each partition, but that doesn't make much sense 
because the partitioning would ensure that each partition has just one 
combination of the required clustering columns. Using a hash partitioner would 
make it so that the in-partition sort basically ignores the first few values, 
so it must be that the intent was a global sort.
On Fri, Mar 30, 2018 at 6:51 AM, Patrick Woody <patrick.woo...@gmail.com> wrote:
Right, you could use this to store a global ordering if there is only 
one write (e.g., CTAS). I don’t think anything needs to change in that 
case, you would still have a clustering and an ordering, but the 
ordering would need to include all fields of the clustering. A way to 
pass in the partition ordinal for the source to store would be required.
Can you expand on how the ordering containing the clustering expressions would 
ensure the global sort? Having an RangePartitioning would certainly satisfy, 
but it isn't required - is the suggestion that if Spark sees this overlap, then 
it plans a global sort?

On Thu, Mar 29, 2018 at 12:16 PM, Russell Spitzer <russell.spit...@gmail.com> 
wrote:
@RyanBlue I'm hoping that through the CBO effort we will continue to get more 
detailed statistics. Like on read we could be using sketch data structures to 
get estimates on unique values and density for each column. You may be right 
that the real way for this to be handled would be giving a "cost" back to a 
higher order optimizer which can decide which method to use rather than having 
the data source itself do it. This is probably in a far future version of the 
api.

On Thu, Mar 29, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com> wrote:

Cassandra can insert records with the same partition-key faster if they arrive 
in the same payload. But this is only beneficial if the incoming dataset has 
multiple entries for the same partition key.

Thanks for the example, the recommended partitioning use case makes more sense 
now. I think we could have two interfaces, a RequiresClustering and a 
RecommendsClustering if we want to support this. But I’m skeptical it will be 
useful for two reasons:

Do we want to optimize the low cardinality case? Shuffles are usually much 
cheaper at smaller sizes, so I’m not sure it is necessary to optimize this away.
How do we know there isn’t just a few partition keys for all the records? It 
may look like a shuffle wouldn’t help, but we don’t know the partition keys 
until it is too late.

Then there’s also the logic for avoiding the shuffle and how to calculate the 
cost, which sounds like something that needs some details from CBO.

I would assume that given the estimated data size from Spark and options passed 
in from the user, the data source could make a more intelligent requirement on 
the write format than Spark independently. 

This is a good point.
What would an implementation actually do here and how would information be 
passed? For my use cases, the store would produce the number of tasks based on 
the estimated incoming rows, because the source has the best idea of how the 
rows will compress. But, that’s just applying a multipli

Re: DataSourceV2 write input requirements

2018-03-28 Thread Ted Yu

Wenchen Fan <cloud0...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Ryan, yea you are right that SupportsReportPartitioning doesn't
>>>>>>>>> expose hash function, so Join can't benefit from this interface, as 
>>>>>>>>> Join
>>>>>>>>> doesn't require a general ClusteredDistribution, but a more specific 
>>>>>>>>> one
>>>>>>>>> called HashClusteredDistribution.
>>>>>>>>>
>>>>>>>>> So currently only Aggregate can benefit from
>>>>>>>>> SupportsReportPartitioning and save shuffle. We can add a new 
>>>>>>>>> interface to
>>>>>>>>> expose the hash function to make it work for Join.
>>>>>>>>>
>>>>>>>>> On Tue, Mar 27, 2018 at 9:33 AM, Ryan Blue <rb...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I just took a look at SupportsReportPartitioning and I'm not sure
>>>>>>>>>> that it will work for real use cases. It doesn't specify, as far as 
>>>>>>>>>> I can
>>>>>>>>>> tell, a hash function for combining clusters into tasks or a way to 
>>>>>>>>>> provide
>>>>>>>>>> Spark a hash function for the other side of a join. It seems 
>>>>>>>>>> unlikely to me
>>>>>>>>>> that many data sources would have partitioning that happens to match 
>>>>>>>>>> the
>>>>>>>>>> other side of a join. And, it looks like task order matters? Maybe 
>>>>>>>>>> I'm
>>>>>>>>>> missing something?
>>>>>>>>>>
>>>>>>>>>> I think that we should design the write side independently based
>>>>>>>>>> on what data stores actually need, and take a look at the read side 
>>>>>>>>>> based
>>>>>>>>>> on what data stores can actually provide. Wenchen, was there a 
>>>>>>>>>> design doc
>>>>>>>>>> for partitioning on the read path?
>>>>>>>>>>
>>>>>>>>>> I completely agree with your point about a global sort. We
>>>>>>>>>> recommend to all of our data engineers to add a sort to most tables 
>>>>>>>>>> because
>>>>>>>>>> it introduces the range partitioner and does a skew calculation, in
>>>>>>>>>> addition to making data filtering much better when it is read. It's 
>>>>>>>>>> really
>>>>>>>>>> common for tables to be skewed by partition values.
>>>>>>>>>>
>>>>>>>>>> rb
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 26, 2018 at 7:59 PM, Patrick Woody <
>>>>>>>>>> patrick.woo...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Ryan, Ted, Wenchen
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the quick replies.
>>>>>>>>>>>
>>>>>>>>>>> @Ryan - the sorting portion makes sense, but I think we'd have
>>>>>>>>>>> to ensure something similar to requiredChildDistribution in 
>>>>>>>>>>> SparkPlan where
>>>>>>>>>>> we have the number of partitions as well if we'd want to further 
>>>>>>>>>>> report to
>>>>>>>>>>> SupportsReportPartitioning, yeah?
>>>>>>>>>>>
>>>>>>>>>>> Specifying an explicit global sort can also be useful for
>>>>>>>>>>> filtering purposes on Parquet row group stats if we have a time 
>>>>>>>>>>> based/high
>>>>>>>>>>> cardinality ID field. If my datasource or catalog knows about 
>>>>>>>>>>> previous
>>>>>>>>>>> queries on a table, it could be really useful to recommend more 
>

Re: DataSourceV2 write input requirements

2018-03-26 Thread Ted Yu

Hmm. Ryan seems to be right.

Looking
at 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportPartitioning.java
:

import org.apache.spark.sql.sources.v2.reader.partitioning.Partitioning;
...
  Partitioning outputPartitioning();

On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan <cloud0...@gmail.com> wrote:

> Actually clustering is already supported, please take a look at
> SupportsReportPartitioning
>
> Ordering is not proposed yet, might be similar to what Ryan proposed.
>
> On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Interesting.
>>
>> Should requiredClustering return a Set of Expression's ?
>> This way, we can determine the order of Expression's by looking at what 
>> requiredOrdering()
>> returns.
>>
>> On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi Pat,
>>>
>>> Thanks for starting the discussion on this, we’re really interested in
>>> it as well. I don’t think there is a proposed API yet, but I was thinking
>>> something like this:
>>>
>>> interface RequiresClustering {
>>>   List requiredClustering();
>>> }
>>>
>>> interface RequiresSort {
>>>   List requiredOrdering();
>>> }
>>>
>>> The reason why RequiresClustering should provide Expression is that it
>>> needs to be able to customize the implementation. For example, writing to
>>> HTable would require building a key (or the data for a key) and that might
>>> use a hash function that differs from Spark’s built-ins. RequiresSort
>>> is fairly straightforward, but the interaction between the two requirements
>>> deserves some consideration. To make the two compatible, I think that
>>> RequiresSort must be interpreted as a sort within each partition of the
>>> clustering, but could possibly be used for a global sort when the two
>>> overlap.
>>>
>>> For example, if I have a table partitioned by “day” and “category” then
>>> the RequiredClustering would be by day, category. A required sort might
>>> be day ASC, category DESC, name ASC. Because that sort satisfies the
>>> required clustering, it could be used for a global ordering. But, is that
>>> useful? How would the global ordering matter beyond a sort within each
>>> partition, i.e., how would the partition’s place in the global ordering be
>>> passed?
>>>
>>> To your other questions, you might want to have a look at the recent
>>> SPIP I’m working on to consolidate and clean up logical plans
>>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
>>> That proposes more specific uses for the DataSourceV2 API that should help
>>> clarify what validation needs to take place. As for custom catalyst rules,
>>> I’d like to hear about the use cases to see if we can build it into these
>>> improvements.
>>>
>>> rb
>>> 
>>>
>>> On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody <patrick.woo...@gmail.com
>>> > wrote:
>>>
>>>> Hey all,
>>>>
>>>> I saw in some of the discussions around DataSourceV2 writes that we
>>>> might have the data source inform Spark of requirements for the input
>>>> data's ordering and partitioning. Has there been a proposed API for that
>>>> yet?
>>>>
>>>> Even one level up it would be helpful to understand how I should be
>>>> thinking about the responsibility of the data source writer, when I should
>>>> be inserting a custom catalyst rule, and how I should handle
>>>> validation/assumptions of the table before attempting the write.
>>>>
>>>> Thanks!
>>>> Pat
>>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>

Re: DataSourceV2 write input requirements

2018-03-26 Thread Ted Yu

Interesting.

Should requiredClustering return a Set of Expression's ?
This way, we can determine the order of Expression's by looking at
what requiredOrdering()
returns.

On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue 
wrote:

> Hi Pat,
>
> Thanks for starting the discussion on this, we’re really interested in it
> as well. I don’t think there is a proposed API yet, but I was thinking
> something like this:
>
> interface RequiresClustering {
>   List requiredClustering();
> }
>
> interface RequiresSort {
>   List requiredOrdering();
> }
>
> The reason why RequiresClustering should provide Expression is that it
> needs to be able to customize the implementation. For example, writing to
> HTable would require building a key (or the data for a key) and that might
> use a hash function that differs from Spark’s built-ins. RequiresSort is
> fairly straightforward, but the interaction between the two requirements
> deserves some consideration. To make the two compatible, I think that
> RequiresSort must be interpreted as a sort within each partition of the
> clustering, but could possibly be used for a global sort when the two
> overlap.
>
> For example, if I have a table partitioned by “day” and “category” then
> the RequiredClustering would be by day, category. A required sort might
> be day ASC, category DESC, name ASC. Because that sort satisfies the
> required clustering, it could be used for a global ordering. But, is that
> useful? How would the global ordering matter beyond a sort within each
> partition, i.e., how would the partition’s place in the global ordering be
> passed?
>
> To your other questions, you might want to have a look at the recent SPIP
> I’m working on to consolidate and clean up logical plans
> .
> That proposes more specific uses for the DataSourceV2 API that should help
> clarify what validation needs to take place. As for custom catalyst rules,
> I’d like to hear about the use cases to see if we can build it into these
> improvements.
>
> rb
> 
>
> On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody 
> wrote:
>
>> Hey all,
>>
>> I saw in some of the discussions around DataSourceV2 writes that we might
>> have the data source inform Spark of requirements for the input data's
>> ordering and partitioning. Has there been a proposed API for that yet?
>>
>> Even one level up it would be helpful to understand how I should be
>> thinking about the responsibility of the data source writer, when I should
>> be inserting a custom catalyst rule, and how I should handle
>> validation/assumptions of the table before attempting the write.
>>
>> Thanks!
>> Pat
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Spark 2.3.0 (RC1)

2018-01-16 Thread Ted Yu

Is there going to be another RC ?

With KafkaContinuousSourceSuite hanging, it is hard to get the rest of the
tests going.

Cheers

On Sat, Jan 13, 2018 at 7:29 AM, Sean Owen  wrote:

> The signatures and licenses look OK. Except for the missing k8s package,
> the contents look OK. Tests look pretty good with "-Phive -Phadoop-2.7
> -Pyarn" on Ubuntu 17.10, except that KafkaContinuousSourceSuite seems to
> hang forever. That was just fixed and needs to get into an RC?
>
> Aside from the Blockers just filed for R docs, etc., we have:
>
> Blocker:
> SPARK-23000 Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in
> Spark 2.3
> SPARK-23020 Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.
> testInProcessLauncher
> SPARK-23051 job description in Spark UI is broken
>
> Critical:
> SPARK-22739 Additional Expression Support for Objects
>
> I actually don't think any of those Blockers should be Blockers; not sure
> if the last one is really critical either.
>
> I think this release will have to be re-rolled so I'd say -1 to RC1.
>
> On Fri, Jan 12, 2018 at 4:42 PM Sameer Agarwal 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.3.0. The vote is open until Thursday January 18, 2018 at 8:00:00 am UTC
>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.3.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.3.0-rc1: https://github.com/apache/
>> spark/tree/v2.3.0-rc1 (964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea)
>>
>> List of JIRA tickets resolved in this release can be found here:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1261/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-
>> docs/_site/index.html
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you can
>> add the staging repository to your projects resolvers and test with the RC
>> (make sure to clean up the artifact cache before/after so you don't end up
>> building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.0?
>> ===
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
>> appropriate.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.2.0. That being said, if
>> there is something which is a regression from 2.2.0 that has not been
>> correctly targeted please ping me or a committer to help target the issue
>> (you can see the open issues listed as impacting Spark 2.3.0 at
>> https://s.apache.org/WmoI).
>>
>> ===
>> What are the unresolved issues targeted for 2.3.0?
>> ===
>>
>> Please see https://s.apache.org/oXKi. At the time of the writing, there
>> are 19 JIRA issues targeting 2.3.0 tracking various QA/audit tasks, test
>> failures and other feature/bugs. In particular, we've currently marked 3
>> JIRAs as release blockers that are being actively worked on:
>>
>> 1. SPARK-23051 that tracks a regression in the Spark UI
>> 2. SPARK-23020 and SPARK-23000 that track a couple of flaky tests that
>> are responsible for build failures. Additionally,
>> https://github.com/apache/spark/pull/20242 fixes a few Java linter
>> errors in RC1.
>>
>> Given that these blockers are fairly isolated, in the sprit of starting a
>> thorough QA early, this RC1 aims to serve as a good approximation of the
>> functionality of final release.
>>
>> Regards,
>> Sameer
>>
>

Re: Broken SQL Visualization?

2018-01-15 Thread Ted Yu

Did you include any picture ?
Looks like the picture didn't go thru.
Please use third party site. 
Thanks
 Original message From: Tomasz Gawęda 
 Date: 1/15/18  2:07 PM  (GMT-08:00) To: 
dev@spark.apache.org, u...@spark.apache.org Subject: Broken SQL Visualization? 

Hi,
today I have updated my test cluster to current Spark master, after that my SQL 
Visualization page started to crash with following error in JS:

Screenshot was cut for readability and to hide internal server names ;)


It may be caused by upgrade or by some code changes, but - to be honest - I did 
not use any new operators nor any new Spark function, so it should render 
correctly, like few days ago. Some Visualizations work fine, some crashes, I 
don't have any doubts why
 it may not work. Can anyone help me? Probably it is a bug in Spark, but it's 
hard to me to say in which place.



Thanks in advance!
Pozdrawiam / Best regards,
Tomek

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Ted Yu

Congratulations, Jerry !

On Mon, Aug 28, 2017 at 6:28 PM, Matei Zaharia 
wrote:

> Hi everyone,
>
> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai
> has been contributing to many areas of the project for a long time, so it’s
> great to see him join. Join me in thanking and congratulating him!
>
> Matei
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Performance Benchmark Hbase vs Cassandra

2017-06-29 Thread Ted Yu

For Cassandra, I found:

https://www.instaclustr.com/multi-data-center-sparkcassandra-benchmark-round-2/

My coworker (on vacation at the moment) was doing benchmark with hbase.
When he comes back, the result can be published.

Note: it is hard to find comparison results with same setup (hardware,
number of nodes, etc).

On Thu, Jun 29, 2017 at 7:33 PM, Raj, Deepu 
wrote:

> Hi Team,
>
>
>
> I want to do a performance benchmark with some specific use case with
>   Spark   àHBase   and Spark à Cassandra.
>
>
>
> Can anyone provide inputs:-
>
> 1.   Scenarios / Parameters to monitor?
>
> 2.   Any automation tool to make this work?
>
> 3.   Any previous Learnings/ Blogs/environment setup?
>
>
>
> Thanks,
>
> Deepu
>

Re: how to mention others in JIRA comment please?

2017-06-26 Thread Ted Yu

You can find the JIRA handle of the person you want to mention by going to
a JIRA where that person has commented.

e.g. you want to find the handle for Joseph.
You can go to:
https://issues.apache.org/jira/browse/SPARK-6635

and click on his name in comment:
https://issues.apache.org/jira/secure/ViewProfile.jspa?name=josephkb

The following constitutes a mention for him:
[~josephkb]

FYI

On Mon, Jun 26, 2017 at 6:56 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> how to mention others in JIRA comment please?
> I added @ before other members' name, but it didn't work.
>
> Would you like help me please?
>
> thanks
> Fei Shao
>

Re: the compile of spark stoped without any hints, would you like help me please?

2017-06-25 Thread Ted Yu

Does adding -X to mvn command give you more information ?

Cheers

On Sun, Jun 25, 2017 at 5:29 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> Today I use new PC to compile SPARK.
> At the beginning, it worked well.
> But it stop at some point.
> the content in consle is :
> 
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] --- maven-shade-plugin:2.4.3:shade (default) @ spark-parent_2.11 ---
> [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded
> jar.
> [INFO] Replacing original artifact with shaded artifact.
> [INFO]
> [INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project Tags 2.1.2-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @
> spark-tags_2.11 ---
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @
> spark-tags_2.11 ---
> [INFO] Add Source directory: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\main\scala
> [INFO] Add Test Source directory: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\test\scala
> [INFO]
> [INFO] --- maven-dependency-plugin:2.10:build-classpath (default-cli) @
> spark-tags_2.11 ---
> [INFO] Dependencies classpath:
> C:\Users\shaof\.m2\repository\org\spark-project\spark\
> unused\1.0.0\unused-1.0.0.jar;C:\Users\shaof\.m2\repository\
> org\scala-lang\scala-library\2.11.8\scala-library-2.11.8.jar
> [INFO]
> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @
> spark-tags_2.11 ---
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
> spark-tags_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\main\resources
> [INFO] Copying 3 resources
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
> spark-tags_2.11 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 2 Scala sources and 6 Java sources to
> E:\spark\fromweb\spark-branch-2.1\common\tags\target\scala-2.11\classes...
> [WARNING] 警告: [options] 未与 -source 1.7 一起设置引导类路径
> [WARNING] 1 个警告
> [INFO]
> [INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @
> spark-tags_2.11 ---
> [INFO] Changes detected - recompiling the module!
> [INFO] Compiling 6 source files to E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\scala-2.11\classes
> [INFO]
> [INFO] --- maven-antrun-plugin:1.8:run (create-tmp-dir) @ spark-tags_2.11
> ---
> [INFO] Executing tasks
>
> main:
> [INFO] Executed tasks
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:testResources
> (default-testResources) @ spark-tags_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\test\resources
> [INFO] Copying 3 resources
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:testCompile
> (scala-test-compile-first) @ spark-tags_2.11 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 3 Java sources to E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\scala-2.11\test-classes...
> [WARNING] 警告: [options] 未与 -source 1.7 一起设置引导类路径
> [WARNING] 1 个警告
> [INFO]
> [INFO] --- maven-compiler-plugin:3.5.1:testCompile (default-testCompile)
> @ spark-tags_2.11 ---
> [INFO] Nothing to compile - all classes are up to date
> [INFO]
> [INFO] --- maven-dependency-plugin:2.10:build-classpath
> (generate-test-classpath) @ spark-tags_2.11 ---
> [INFO]
> [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @
> spark-tags_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- maven-surefire-plugin:2.19.1:test (test) @ spark-tags_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- scalatest-maven-plugin:1.0:test (test) @ spark-tags_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @
> spark-tags_2.11 ---
> [INFO] Building jar: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\spark-tags_2.11-2.1.2-SNAPSHOT-tests.jar
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:jar (default-jar) @ spark-tags_2.11 ---
> [INFO] Building jar: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\spark-tags_2.11-2.1.2-SNAPSHOT.jar
> [INFO]
> [INFO] ---

Re: Should we consider a Spark 2.1.1 release?

2017-03-20 Thread Ted Yu

Timur:
Mind starting a new thread ?

I have the same question as you have. 

> On Mar 20, 2017, at 11:34 AM, Timur Shenkao  wrote:
> 
> Hello guys,
> 
> Spark benefits from stable versions not frequent ones.
> A lot of people still have 1.6.x in production. Those who wants the freshest 
> (like me) can always deploy night builts.
> My question is: how long version 1.6 will be supported? 
> 
> On Sunday, March 19, 2017, Holden Karau  wrote:
>> This discussions seems like it might benefit from its own thread as we've 
>> previously decided to lengthen release cycles but if their are different 
>> opinions about this it seems unrelated to the specific 2.1.1 release.
>> 
>>> On Sun, Mar 19, 2017 at 2:57 PM Jacek Laskowski  wrote:
>>> Hi Mark,
>>> 
>>> I appreciate your comment.
>>> 
>>> My thinking is that the more frequent minor and patch releases the
>>> more often end users can give them a shot and be part of the bigger
>>> release cycle for major releases. Spark's an OSS project and we all
>>> can make mistakes and my thinking is is that the more eyeballs the
>>> less the number of the mistakes. If we make very fine/minor releases
>>> often we should be able to attract more people who spend their time on
>>> testing/verification that eventually contribute to a higher quality of
>>> Spark.
>>> 
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>> 
>>> 
>>> On Sun, Mar 19, 2017 at 10:50 PM, Mark Hamstra  
>>> wrote:
>>> > That doesn't necessarily follow, Jacek. There is a point where too 
>>> > frequent
>>> > releases decrease quality. That is because releases don't come for free --
>>> > each one demands a considerable amount of time from release managers,
>>> > testers, etc. -- time that would otherwise typically be devoted to 
>>> > improving
>>> > (or at least adding to) the code. And that doesn't even begin to consider
>>> > the time that needs to be spent putting a new version into a larger 
>>> > software
>>> > distribution or that users need to put in to deploy and use a new version.
>>> > If you have an extremely lightweight deployment cycle, then small, quick
>>> > releases can make sense; but "lightweight" doesn't really describe a Spark
>>> > release. The concern for excessive overhead is a large part of the 
>>> > thinking
>>> > behind why we stretched out the roadmap to allow longer intervals between
>>> > scheduled releases. A similar concern does come into play for unscheduled
>>> > maintenance releases -- but I don't think that that is the forcing 
>>> > function
>>> > at this point: A 2.1.1 release is a good idea.
>>> >
>>> > On Sun, Mar 19, 2017 at 6:24 AM, Jacek Laskowski  wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> More smaller and more frequent releases (so major releases get even more
>>> >> quality).
>>> >>
>>> >> Jacek
>>> >>
>>> >> On 13 Mar 2017 8:07 p.m., "Holden Karau"  wrote:
>>> >>>
>>> >>> Hi Spark Devs,
>>> >>>
>>> >>> Spark 2.1 has been out since end of December and we've got quite a few
>>> >>> fixes merged for 2.1.1.
>>> >>>
>>> >>> On the Python side one of the things I'd like to see us get out into a
>>> >>> patch release is a packaging fix (now merged) before we upload to PyPI &
>>> >>> Conda, and we also have the normal batch of fixes like toLocalIterator 
>>> >>> for
>>> >>> large DataFrames in PySpark.
>>> >>>
>>> >>> I've chatted with Felix & Shivaram who seem to think the R side is
>>> >>> looking close to in good shape for a 2.1.1 release to submit to CRAN (if
>>> >>> I've miss-spoken my apologies). The two outstanding issues that are 
>>> >>> being
>>> >>> tracked for R are SPARK-18817, SPARK-19237.
>>> >>>
>>> >>> Looking at the other components quickly it seems like structured
>>> >>> streaming could also benefit from a patch release.
>>> >>>
>>> >>> What do others think - are there any issues people are actively 
>>> >>> targeting
>>> >>> for 2.1.1? Is this too early to be considering a patch release?
>>> >>>
>>> >>> Cheers,
>>> >>>
>>> >>> Holden
>>> >>> --
>>> >>> Cell : 425-233-8271
>>> >>> Twitter: https://twitter.com/holdenkarau
>>> >
>>> >
>> 
>> -- 
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu

The references are vendor specific.

Suggest contacting vendor's mailing list for your PR.

My initial interpretation of HBase repository is that of Apache.

Cheers

On Wed, Jan 25, 2017 at 7:38 AM, Chetan Khatri <chetan.opensou...@gmail.com>
wrote:

> @Ted Yu, Correct but HBase-Spark module available at HBase repository
> seems too old and written code is not optimized yet, I have been already
> submitted PR for the same. I dont know if it is clearly mentioned that now
> it is part of HBase itself then people are committing to older repo where
> original code is still old. [1]
>
> Other sources has updated info [2]
>
> Ref.
> [1] http://blog.cloudera.com/blog/2015/08/apache-spark-
> comes-to-apache-hbase-with-hbase-spark-module/
> [2] https://github.com/cloudera-labs/SparkOnHBase ,
> https://github.com/esamson/SparkOnHBase
>
> On Wed, Jan 25, 2017 at 8:13 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Though no hbase release has the hbase-spark module, you can find the
>> backport patch on HBASE-14160 (for Spark 1.6)
>>
>> You can build the hbase-spark module yourself.
>>
>> Cheers
>>
>> On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Spark Community Folks,
>>>
>>> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk
>>> Load from Hbase to Hive.
>>>
>>> I have seen couple of good example at HBase Github Repo:
>>> https://github.com/apache/hbase/tree/master/hbase-spark
>>>
>>> If I would like to use HBaseContext with HBase 1.2.4, how it can be done
>>> ? Or which version of HBase has more stability with HBaseContext ?
>>>
>>> Thanks.
>>>
>>
>>
>

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu

Though no hbase release has the hbase-spark module, you can find the
backport patch on HBASE-14160 (for Spark 1.6)

You can build the hbase-spark module yourself.

Cheers

On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri 
wrote:

> Hello Spark Community Folks,
>
> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk
> Load from Hbase to Hive.
>
> I have seen couple of good example at HBase Github Repo:
> https://github.com/apache/hbase/tree/master/hbase-spark
>
> If I would like to use HBaseContext with HBase 1.2.4, how it can be done ?
> Or which version of HBase has more stability with HBaseContext ?
>
> Thanks.
>

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu

Incremental load traditionally means generating hfiles and
using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
data into hbase.

For your use case, the producer needs to find rows where the flag is 0 or 1.
After such rows are obtained, it is up to you how the result of processing
is delivered to hbase.

Cheers

On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <chetan.opensou...@gmail.com>
wrote:

> Ok, Sure will ask.
>
> But what would be generic best practice solution for Incremental load from
> HBASE.
>
> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> I haven't used Gobblin.
>> You can consider asking Gobblin mailing list of the first option.
>>
>> The second option would work.
>>
>>
>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Guys,
>>>
>>> I would like to understand different approach for Distributed
>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>> satisfy requirement ?
>>>
>>> *Approach 1:*
>>>
>>> Write Kafka Producer and maintain manually column flag for events and
>>> ingest it with Linkedin Gobblin to HDFS / S3.
>>>
>>> *Approach 2:*
>>>
>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>> maintain flag column at HBase Level.
>>>
>>> In above both approach, I need to maintain column level flags. such as 0
>>> - by default, 1-sent,2-sent and acknowledged. So next time Producer will
>>> take another 1000 rows of batch where flag is 0 or 1.
>>>
>>> I am looking for best practice approach with any distributed tool.
>>>
>>> Thanks.
>>>
>>> - Chetan Khatri
>>>
>>
>>
>

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu

I haven't used Gobblin.
You can consider asking Gobblin mailing list of the first option.

The second option would work.


On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri 
wrote:

> Hello Guys,
>
> I would like to understand different approach for Distributed Incremental
> load from HBase, Is there any *tool / incubactor tool* which satisfy
> requirement ?
>
> *Approach 1:*
>
> Write Kafka Producer and maintain manually column flag for events and
> ingest it with Linkedin Gobblin to HDFS / S3.
>
> *Approach 2:*
>
> Run Scheduled Spark Job - Read from HBase and do transformations and
> maintain flag column at HBase Level.
>
> In above both approach, I need to maintain column level flags. such as 0 -
> by default, 1-sent,2-sent and acknowledged. So next time Producer will take
> another 1000 rows of batch where flag is 0 or 1.
>
> I am looking for best practice approach with any distributed tool.
>
> Thanks.
>
> - Chetan Khatri
>

Re: Difference between netty and netty-all

2016-12-05 Thread Ted Yu

This should be in netty-all :

$ jar tvf
/home/x/.m2/repository/io/netty/netty-all/4.0.29.Final/netty-all-4.0.29.Final.jar
| grep ThreadLocalRandom
   967 Tue Jun 23 11:10:30 UTC 2015
io/netty/util/internal/ThreadLocalRandom$1.class
  1079 Tue Jun 23 11:10:30 UTC 2015
io/netty/util/internal/ThreadLocalRandom$2.class
  5973 Tue Jun 23 11:10:30 UTC 2015
io/netty/util/internal/ThreadLocalRandom.class

On Mon, Dec 5, 2016 at 8:53 AM, Nicholas Chammas  wrote:

> I’m looking at the list of dependencies here:
>
> https://github.com/apache/spark/search?l=Groff=netty;
> type=Code=%E2%9C%93
>
> What’s the difference between netty and netty-all?
>
> The reason I ask is because I’m looking at a Netty PR
>  and trying to figure out if
> Spark 2.0.2 is using a version of Netty that includes that PR or not.
>
> Nick
> 
>

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Ted Yu

Makes sense. 

I trust Hyukjin, Holden and Cody's judgement in respective areas. 

I just wish to see more participation from the committers. 

Thanks 

> On Oct 8, 2016, at 8:27 AM, Sean Owen  wrote:
> 
> Hyukjin

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Ted Yu

I think only committers should resolve JIRAs which were not created by himself 
/ herself. 

> On Oct 8, 2016, at 6:53 AM, Hyukjin Kwon  wrote:
> 
> I am uncertain too. It'd be great if these are documented too.
> 
> FWIW, in my case, I privately asked and told Sean first that I am going to 
> look though the JIRAs
> and resolve some via the suggested conventions from Sean.
> (Definitely all blames should be on me if I have done something terribly 
> wrong). 
> 
> 
> 
> 2016-10-08 22:37 GMT+09:00 Cody Koeninger :
>> That makes sense, thanks.
>> 
>> One thing I've never been clear on is who should be allowed to resolve 
>> Jiras.  Can I go clean up the backlog of Kafka Jiras that weren't created by 
>> me?
>> 
>> If there's an informal policy here, can we update the wiki to reflect it?  
>> Maybe it's there already, but I didn't see it last time I looked.
>> 
>> 
>> On Oct 8, 2016 4:10 AM, "Sean Owen"  wrote:
>> That flood of emails means several people (Xiao, Holden mostly AFAICT) have 
>> been updating the status of old JIRAs. Thank you, I think that really does 
>> help. 
>> 
>> I have a suggested set of conventions I've been using, just to bring some 
>> order to the resolutions. It helps because JIRA functions as a huge archive 
>> of decisions and the more accurately we can record that the better. What do 
>> people think of this?
>> 
>> - Resolve as Fixed if there's a change you can point to that resolved the 
>> issue
>> - If the issue is a proper subset of another issue, mark it a Duplicate of 
>> that issue (rather than the other way around)
>> - If it's probably resolved, but not obvious what fixed it or when, then 
>> Cannot Reproduce or Not a Problem
>> - Obsolete issue? Not a Problem
>> - If it's a coherent issue but does not seem like there is support or 
>> interest in acting on it, then Won't Fix
>> - If the issue doesn't make sense (non-Spark issue, etc) then Invalid
>> - I tend to mark Umbrellas as "Done" when done if they're just containers
>> - Try to set Fix version
>> - Try to set Assignee to the person who most contributed to the resolution. 
>> Usually the person who opened the PR. Strong preference for ties going to 
>> the more 'junior' contributor
>> 
>> The only ones I think are sort of important are getting the Duplicate 
>> pointers right, and possibly making sure that Fixed issues have a clear path 
>> to finding what change fixed it and when. The rest doesn't matter much.
>

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Ted Yu

'Spark 1.x and Scala 2.10 & 2.11' was repeated.

I guess your second line should read:

org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark 2.x and
Scala 2.10 & 2.11

On Wed, Aug 24, 2016 at 9:41 AM, Michael Heuer  wrote:

> Hello,
>
> We're a project downstream of Spark and need to provide separate artifacts
> for Spark 1.x and Spark 2.x.  Has any convention been established or even
> proposed for artifact names and/or qualifiers?
>
> We are currently thinking
>
> org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1]  for Spark 1.x and
> Scala 2.10 & 2.11
>
>   and
>
> org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark 1.x
> and Scala 2.10 & 2.11
>
> https://github.com/bigdatagenomics/adam/issues/1093
>
>
> Thanks in advance,
>
>michael
>

Re: Welcoming Felix Cheung as a committer

2016-08-08 Thread Ted Yu

Congratulations, Felix.

On Mon, Aug 8, 2016 at 11:15 AM, Matei Zaharia 
wrote:

> Hi all,
>
> The PMC recently voted to add Felix Cheung as a committer. Felix has been
> a major contributor to SparkR and we're excited to have him join
> officially. Congrats and welcome, Felix!
>
> Matei
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: SQL Based Authorization for SparkSQL

2016-08-02 Thread Ted Yu

There was SPARK-12008 which was closed.

Not sure if there is active JIRA in this regard.

On Tue, Aug 2, 2016 at 6:40 PM, 马晓宇  wrote:

> Hi guys,
>
> I wonder if anyone working on SQL based authorization already or not.
>
> This is something we needed badly right now and we tried to embedded a
> Hive frontend in front of SparkSQL to achieve this but it's not quite a
> elegant solution. If SparkSQL has a way to do it or anyone already working
> on it?
>
> If not, we might consider make some contributions here and might need
> guidance during the work.
>
> Thanks.
>
> Shawn
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Build speed

2016-07-22 Thread Ted Yu

I assume you have enabled Zinc.

Cheers

On Fri, Jul 22, 2016 at 7:54 AM, Mikael Ståldal 
wrote:

> Is there any way to speed up an incremental build of Spark?
>
> For me it takes 8 minutes to build the project with just a few code
> changes.
>
> --
> [image: MagineTV]
>
> *Mikael Ståldal*
> Senior software developer
>
> *Magine TV*
> mikael.stal...@magine.com
> Grev Turegatan 3  | 114 46 Stockholm, Sweden  |   www.magine.com
>
> Privileged and/or Confidential Information may be contained in this
> message. If you are not the addressee indicated in this message
> (or responsible for delivery of the message to such a person), you may not
> copy or deliver this message to anyone. In such case,
> you should destroy this message and kindly notify the sender by reply
> email.
>

Re: Spark performance regression test suite

2016-07-08 Thread Ted Yu

Found a few issues:

[SPARK-6810] Performance benchmarks for SparkR
[SPARK-2833] performance tests for linear regression
[SPARK-15447] Performance test for ALS in Spark 2.0

Haven't found one for Spark core.

On Fri, Jul 8, 2016 at 8:58 AM, Michael Allman  wrote:

> Hello,
>
> I've seen a few messages on the mailing list regarding Spark performance
> concerns, especially regressions from previous versions. It got me thinking
> that perhaps an automated performance regression suite would be a
> worthwhile contribution? Is anyone working on this? Do we have a Jira issue
> for it?
>
> I cannot commit to taking charge of such a project. I just thought it
> would be a great contribution for someone who does have the time and the
> chops to build it.
>
> Cheers,
>
> Michael
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Ted Yu

Running the following command:
build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr
-Dhadoop.version=2.7.0 package

The build stopped with this test failure:

^[[31m- SPARK-9757 Persist Parquet relation with decimal column *** FAILED
***^[[0m


On Wed, Jul 6, 2016 at 6:25 AM, Sean Owen  wrote:

> Yeah we still have some blockers; I agree SPARK-16379 is a blocker
> which came up yesterday. We also have 5 existing blockers, all doc
> related:
>
> SPARK-14808 Spark MLlib, GraphX, SparkR 2.0 QA umbrella
> SPARK-14812 ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final,
> sealed audit
> SPARK-14816 Update MLlib, GraphX, SparkR websites for 2.0
> SPARK-14817 ML, Graph, R 2.0 QA: Programming guide update and migration
> guide
> SPARK-15124 R 2.0 QA: New R APIs and API docs
>
> While we'll almost surely need another RC, this one is well worth
> testing. It's much closer than even the last one.
>
> The sigs/hashes check out, and I successfully built with Ubuntu 16 /
> Java 8 with -Pyarn -Phadoop-2.7 -Phive. Tests pass except for:
>
> DirectKafkaStreamSuite:
> - offset recovery *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 196
> times over 10.028979855 seconds. Last failure message:
> strings.forall({
> ((x$1: Any) => DirectKafkaStreamSuite.collectedData.contains(x$1))
>   }) was false. (DirectKafkaStreamSuite.scala:250)
> - Direct Kafka stream report input information
>
> I know we've seen this before and tried to fix it but it may need another
> look.
>
> On Wed, Jul 6, 2016 at 6:35 AM, Reynold Xin  wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.0.0. The vote is open until Friday, July 8, 2016 at 23:00 PDT and
> passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.0.0
> > [ ] -1 Do not release this package because ...
> >
> >
> > The tag to be voted on is v2.0.0-rc2
> > (4a55b2326c8cf50f772907a8b73fd5e7b3d1aa06).
> >
> > This release candidate resolves ~2500 issues:
> > https://s.apache.org/spark-2.0.0-jira
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1189/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/
> >
> >
> > =
> > How can I help test this release?
> > =
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions from 1.x.
> >
> > ==
> > What justifies a -1 vote for this release?
> > ==
> > Critical bugs impacting major functionalities.
> >
> > Bugs already present in 1.x, missing features, or bugs related to new
> > features will not necessarily block this release. Note that historically
> > Spark documentation has been published on the website separately from the
> > main release so we do not need to block the release due to documentation
> > errors either.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-17 Thread Ted Yu

Docker Integration Tests failed on Linux:

http://pastebin.com/Ut51aRV3

Here was the command I used:

mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr
-Dhadoop.version=2.7.0 package

Has anyone seen similar error ?

Thanks

On Thu, Jun 16, 2016 at 9:49 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.2!
>
> The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v1.6.2-rc1
> (4168d9c94a9564f6b3e62f5d669acde13a7c7cf6)
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1184
>
> The documentation corresponding to this release can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-docs/
>
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.6.1.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This is a maintenance release in the 1.6.x series.  Bugs already present
> in 1.6.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
>

Re: Kryo registration for Tuples?

2016-06-08 Thread Ted Yu

I think the second group (3 classOf's) should be used.

Cheers

On Wed, Jun 8, 2016 at 4:53 PM, Alexander Pivovarov 
wrote:

> if my RDD is RDD[(String, (Long, MyClass))]
>
> Do I need to register
>
> classOf[MyClass]
> classOf[(Any, Any)]
>
> or
>
> classOf[MyClass]
> classOf[(Long, MyClass)]
> classOf[(String, (Long, MyClass))]
>
> ?
>
>

Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu

Please go ahead.

On Tue, Jun 7, 2016 at 4:45 PM, franklyn 
wrote:

> Thanks for reproducing it Ted, should i make a Jira Issue?.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp17845p17852.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu

I built with Scala 2.10

>>> df.select(add_one(df.a).alias('incremented')).collect()

The above just hung.

On Tue, Jun 7, 2016 at 3:31 PM, franklyn 
wrote:

> Thanks Ted !.
>
> I'm using
>
> https://github.com/apache/spark/commit/8f5a04b6299e3a47aca13cbb40e72344c0114860
> and building with scala-2.10
>
> I can confirm that it works with scala-2.11
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp17845p17847.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Dataset API agg question

2016-06-07 Thread Ted Yu

Have you tried the following ?

Seq(1->2, 1->5, 3->6).toDS("a", "b")

then you can refer to columns by name.

FYI


On Tue, Jun 7, 2016 at 3:58 PM, Alexander Pivovarov 
wrote:

> I'm trying to switch from RDD API to Dataset API
> My question is about reduceByKey method
>
> e.g. in the following example I'm trying to rewrite
>
> sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10)
>
> using DS API. That is what I have so far:
>
> Seq(1->2, 1->5, 
> 3->6).toDS.groupBy(_._1).agg(max($"_2").as(ExpressionEncoder[Int])).take(10)
>
> Questions:
>
> 1. is it possible to avoid typing "as(ExpressionEncoder[Int])" or replace
> it with smth shorter?
>
> 2.  Why I have to use String column name in max function? e.g. $"_2" or
> col("_2").  can I use _._2 instead?
>
>
> Alex
>

Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu

With commit 200f01c8fb15680b5630fbd122d44f9b1d096e02 using Scala 2.11:

Using Python version 2.7.9 (default, Apr 29 2016 10:48:06)
SparkSession available as 'spark'.
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import IntegerType, StructField, StructType
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import Row
>>> spark = SparkSession.builder.master('local[4]').appName('2.0
DF').getOrCreate()
>>> add_one = udf(lambda x: x + 1, IntegerType())
>>> schema = StructType([StructField('a', IntegerType(), False)])
>>> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
>>> df.select(add_one(df.a).alias('incremented')).collect()
[Row(incremented=2), Row(incremented=3)]

Let me build with Scala 2.10 and try again.

On Tue, Jun 7, 2016 at 2:47 PM, Franklyn D'souza <
franklyn.dso...@shopify.com> wrote:

> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
>>
>>
>> ./dev/change-version-to-2.10.sh
>> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5
>> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
>
>
> and then ran the following code in a pyspark shell
>
> from pyspark.sql import SparkSession
>> from pyspark.sql.types import IntegerType, StructField, StructType
>> from pyspark.sql.functions import udf
>> from pyspark.sql.types import Row
>> spark = SparkSession.builder.master('local[4]').appName('2.0
>> DF').getOrCreate()
>> add_one = udf(lambda x: x + 1, IntegerType())
>> schema = StructType([StructField('a', IntegerType(), False)])
>> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
>> df.select(add_one(df.a).alias('incremented')).collect()
>
>
> This never returns with a result.
>
>
>

Re: Can't compile 2.0-preview with scala 2.10

2016-06-06 Thread Ted Yu

See the following from
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-sbt-SCALA-2.10/1642/consoleFull
:

+ SBT_FLAGS+=('-Dscala-2.10')
+ ./dev/change-scala-version.sh 2.10


FYI


On Mon, Jun 6, 2016 at 10:35 AM, Franklyn D'souza <
franklyn.dso...@shopify.com> wrote:

> Hi,
>
> I've checked out the 2.0-preview and attempted to build it
> with ./dev/make-distribution.sh -Pscala-2.10
>
> However i keep getting
>
> [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @
> spark-parent_2.11 ---
> [WARNING] Rule 0: org.apache.maven.plugins.enforcer.BannedDependencies
> failed with message:
> Found Banned Dependency: org.scala-lang.modules:scala-xml_2.11:jar:1.0.2
> Found Banned Dependency: org.scalatest:scalatest_2.11:jar:2.2.6
>
> Is scala 2.10 not being supported going forward ?. If so the profile
> should probably be removed from the master pom.xml
>
>
> Thanks,
>
> Franklyn
>
>
>

Re: Welcoming Yanbo Liang as a committer

2016-06-03 Thread Ted Yu

Congratulations, Yanbo.

On Fri, Jun 3, 2016 at 7:48 PM, Matei Zaharia 
wrote:

> Hi all,
>
> The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been a
> super active contributor in many areas of MLlib. Please join me in
> welcoming Yanbo!
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-05-24 Thread Ted Yu

Please log a JIRA.

Thanks

On Tue, May 24, 2016 at 8:33 AM, Koert Kuipers  wrote:

> hello,
> as we continue to test spark 2.0 SNAPSHOT in-house we ran into the
> following trying to port an existing application from spark 1.6.1 to spark
> 2.0.0-SNAPSHOT.
>
> given this code:
>
> case class Test(a: Int, b: String)
> val rdd = sc.parallelize(List(Row(List(Test(5, "ha"), Test(6, "ba")
> val schema = StructType(Seq(
>   StructField("x", ArrayType(
> StructType(Seq(
>   StructField("a", IntegerType, false),
>   StructField("b", StringType, true)
> )),
> true)
>   , true)
>   ))
> val df = sqlc.createDataFrame(rdd, schema)
> df.show
>
> this works fine in spark 1.6.1 and gives:
>
> ++
> |   x|
> ++
> |[[5,ha], [6,ba]]|
> ++
>
> but in spark 2.0.0-SNAPSHOT i get:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 0.0 (TID 0, localhost): java.lang.RuntimeException: Error while encoding:
> java.lang.ClassCastException: Test cannot be cast to
> org.apache.spark.sql.Row
> [info] getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0,
> x, IntegerType) AS x#0
> [info] +- getexternalrowfield(input[0, org.apache.spark.sql.Row, false],
> 0, x, IntegerType)
> [info]+- input[0, org.apache.spark.sql.Row, false]
>
>

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Ted Yu

For #1 below, currently Jenkins uses Java 8:

JAVA_HOME=/usr/java/jdk1.8.0_60


How about switching to Java 7 ?


My two cents.


On Mon, May 23, 2016 at 1:24 PM, Dongjoon Hyun <dongj...@apache.org> wrote:

> Thank you for your opinion!
>
> Sure. I know that history and totally agree with all your concerns.
> I indeed has hesitated about sending this kind of suggestion for a while.
>
> If Travis CI cannot handle those simple jobs at this time again,
> we must turn off from Spark PR queue.
> We can see the result quickly in one or two days.
> To turn on/off, Spark have nothing to do. INFRA team will do that.
>
> In fact, the goal is not about using another CI (like Travis), it is about
> preventing the followings.
>
> 1. JDK7 compilation errors. (Recently, 2 days ago and 5 days ago)
> 2. Java static errors. (Not critical but more frequently.)
> 3. Maven installation errors. (A month ago, it's reported in this mailing
> list.)
>
> Scala 2.10 compilation errors are fixed nearly instantly. But, 1~3 were
> not.
> If SparkPullRequestBuilder can do the above 1~3, that's the best for us.
> Do you think it is possible in some ways?
>
> By the way, as of today, Spark has 724 Java files and 96762 lines (without
> comment/blank).
> It's about 1/3 of Scala code. It's not small.
> --
> Language  files  blankcomment code
> --
> Scala  2368  63578 124904   322518
> Java724  18569  23445
> 96762
>
> Dongjoon.
>
>
>
> On Mon, May 23, 2016 at 12:20 PM, Michael Armbrust <mich...@databricks.com
> > wrote:
>
>> We did turn on travis a few years ago, but ended up turning it off
>> because it was failing (I believe because of insufficient resources) which
>> was confusing for developers.  I wouldn't be opposed to turning it on if it
>> provides more/faster signal, but its not obvious to me that it would.  In
>> particular, do we know that given the rate PRs are created if we will hit
>> rate limits?
>>
>> Really my main feedback is, if the java linter is important we should
>> probably have it as part of the canonical build process.  I worry about
>> having more than one set of CI infrastructure to maintain.
>>
>> On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun <dongj...@apache.org>
>> wrote:
>>
>>> Thank you, Steve and Hyukjin.
>>>
>>> And, don't worry, Ted.
>>>
>>> Travis launches new VMs for every PR.
>>>
>>> Apache Spark repository uses the following setting.
>>>
>>> VM: Google Compute Engine
>>> OS: Ubuntu 14.04.3 LTS Server Edition 64bit
>>> CPU: ~2 CORE
>>> RAM: 7.5GB
>>>
>>> FYI, you can find more information about this here.
>>>
>>>
>>> https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>> On Mon, May 23, 2016 at 6:32 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Do you know if more than one PR would be verified on the same machine ?
>>>>
>>>> I wonder whether the 'mvn install' from two simultaneous PR builds may
>>>> have conflict.
>>>>
>>>> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun <dongj...@apache.org>
>>>> wrote:
>>>>
>>>>> Thank you for feedback. Sure, correctly, that's the reason why the
>>>>> current SparkPullRequestBuilder do not run `lint-java`. :-)
>>>>>
>>>>> In addition, that's the same reason why contributors are reluctant to
>>>>> run `lint-java` and causes breaking on JDK7 builds.
>>>>>
>>>>> Such a tedious and time-consuming job should be done by CI without
>>>>> human interventions.
>>>>>
>>>>> By the way, why do you think we need to wait for that? We should not
>>>>> wait for any CIs, we should continue our own work.
>>>>>
>>>>> My proposal isn't for making you wait to watch the result. There are
>>>>> two use cases I want for us to focus here.
>>>>>
>>>>> Case 1: When you make a PR to Spark PR queue.
>>>>>
>>>>> Travis CI will finish before SparkPullRequestBuilder.
>>>>> We will run the followings in parallel mode.
>>>>>  1. Current

Re: Running TPCDSQueryBenchmark results in java.lang.OutOfMemoryError

2016-05-23 Thread Ted Yu

Can you tell us the commit hash using which the test was run ?

For #2, if you can give full stack trace, that would be nice.

Thanks

On Mon, May 23, 2016 at 8:58 AM, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Hi
>
> 1) Using latest spark 2.0 I've managed to run TPCDSQueryBenchmark first 9
> queries and then it ends in the OutOfMemoryError [1].
>
> *What was the configuration used for running this benchmark? Can you
> explain the meaning of 4 shuffle partitions? Thanks!*
>
> On my local system I use:
> ./bin/spark-submit --class
> org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --master
> local[4] jars/spark-sql_2.11-2.0.0-SNAPSHOT-tests.jar
> configured with:
>   .set("spark.sql.parquet.compression.codec", "snappy")
>   .set("spark.sql.shuffle.partitions", "4")
>   .set("spark.driver.memory", "3g")
>   .set("spark.executor.memory", "3g")
>   .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024
> ).toString)
>
> Scale factor of TPCDS is 5, data generated using notes from
> https://github.com/databricks/spark-sql-perf.
>
> 2) Running spark-sql-perf with: val experiment =
> tpcds.runExperiment(tpcds.runnable) on the same dataset reveals some
> exceptions:
>
> Running execution *q9-v1.4* iteration: 1, StandardRun=true
> java.lang.NullPointerException
> at
> org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45)
> at
> org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103)
> at
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165)
> at
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
> ... at
> org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:289)
> at
> org.apache.spark.sql.execution.DeserializeToObject$$anonfun$2.apply(objects.scala:61)
> at
> org.apache.spark.sql.execution.DeserializeToObject$$anonfun$2.apply(objects.scala:60)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>
> or
>
> Running execution q25-v1.4 iteration: 1, StandardRun=true
> java.lang.IllegalStateException: Task -1024 has already locked
> broadcast_755_piece0 for writing
> at
> org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:232)
> at
> org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1296)
>
> Best,
> Ovidiu
>
> [1]
> Exception in thread "broadcast-exchange-164" java.lang.OutOfMemoryError:
> Java heap space
> at
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:539)
> at
> org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:803)
> at
> org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:105)
> at
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:816)
> at
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:812)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:89)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:71)
> at
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:94)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:71)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:71)
> at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Ted Yu

Do you know if more than one PR would be verified on the same machine ?

I wonder whether the 'mvn install' from two simultaneous PR builds may have
conflict.

On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun <dongj...@apache.org> wrote:

> Thank you for feedback. Sure, correctly, that's the reason why the current
> SparkPullRequestBuilder do not run `lint-java`. :-)
>
> In addition, that's the same reason why contributors are reluctant to run
> `lint-java` and causes breaking on JDK7 builds.
>
> Such a tedious and time-consuming job should be done by CI without human
> interventions.
>
> By the way, why do you think we need to wait for that? We should not wait
> for any CIs, we should continue our own work.
>
> My proposal isn't for making you wait to watch the result. There are two
> use cases I want for us to focus here.
>
> Case 1: When you make a PR to Spark PR queue.
>
> Travis CI will finish before SparkPullRequestBuilder.
> We will run the followings in parallel mode.
>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no Java
> Linter)
>  2. Travis: JDK7 + mvn build + Java Linter
>  3. Travis: JDK8 + mvn build + Java Linter
>  As we know, 1 is the longest time-consuming one which have lots of
> works (except maven building or lint-  java). You don't need to wait more
> in many cases. Yes, in many cases, not all the cases.
>
>
> Case 2: When you prepare a PR on your branch.
>
> If you are at the final commit (maybe already-squashed), just go to
> case 1.
>
> However, usually, we makes lots of commits locally while making
> preparing our PR.
> And, finally we squashed them into one and send a PR to Spark.
> I mean you can use Travis CI during preparing your PRs.
> Again, don't wait for Travis CI. Just push it sometime or at every
> commit, and continue your work.
>
> At the final stage when you finish your coding, squash your commits
> into one,
> and amend your commit title or messages, see the Travis CI.
> Or, you can monitor Travis CI result on status menu bar.
> If it shows green icon, you have nothing to do.
>
>https://docs.travis-ci.com/user/apps/
>
> To sum up, I think we don't need to wait for any CIs. It's like an email.
> `Send and back to work.`
>
> Dongjoon.
>
>
> On Sun, May 22, 2016 at 8:32 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Without Zinc, 'mvn -DskipTests clean install' takes ~30 minutes.
>>
>> Maybe not everyone is willing to wait that long.
>>
>> On Sun, May 22, 2016 at 1:30 PM, Dongjoon Hyun <dongj...@apache.org>
>> wrote:
>>
>>> Oh, Sure. My bad!
>>>
>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>> - For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`.
>>>
>>> Thank you, Ted.
>>>
>>> Dongjoon.
>>>
>>> On Sun, May 22, 2016 at 1:29 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> The following line was repeated twice:
>>>>
>>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>>>
>>>> Did you intend to cover JDK 8 ?
>>>>
>>>> Cheers
>>>>
>>>> On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun <dongj...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> I want to propose the followings.
>>>>>
>>>>> - Turn on Travis CI for Apache Spark PR queue.
>>>>> - Recommend this for contributors, too
>>>>>
>>>>> Currently, Spark provides Travis CI configuration file to help
>>>>> contributors check Scala/Java style conformance and JDK7/8 compilation
>>>>> easily during their preparing pull requests. Please note that it's only
>>>>> about static analysis.
>>>>>
>>>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>>>> Scalastyle is included in the step 'mvn install', too.
>>>>>
>>>>> Yep, if you turn on your Travis CI configuration, you can already see
>>>>> the results on your branches before making PR. I wrote this email to
>>>>> prevent more failures proactively and community-widely.
>>>>>
>>>>> For stability, I have been monitoring that for two weeks. It detects
>>>>> the failures or recovery on JDK7 builds or Java linter on Spark master
>>>>> branch correctly. The only exceptional case I observed rarely is `timeout`
>>>>> failure due to hangs of maven. But, as we know, it's happen in our Jenkins
>>>>> SparkPullRequestBuilder, too. I think we can ignore that.
>>>>>
>>>>> I'm sure that this will save much more community's efforts on the
>>>>> static errors by preventing them at the very early stage. But, there might
>>>>> be another reason not to do this. I'm wondering about your thoughts.
>>>>>
>>>>> I can make a Apache INFRA Jira issue for this if there is some
>>>>> consensus.
>>>>>
>>>>> Warmly,
>>>>> Dongjoon.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-22 Thread Ted Yu

Without Zinc, 'mvn -DskipTests clean install' takes ~30 minutes.

Maybe not everyone is willing to wait that long.

On Sun, May 22, 2016 at 1:30 PM, Dongjoon Hyun <dongj...@apache.org> wrote:

> Oh, Sure. My bad!
>
> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
> - For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`.
>
> Thank you, Ted.
>
> Dongjoon.
>
> On Sun, May 22, 2016 at 1:29 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> The following line was repeated twice:
>>
>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>
>> Did you intend to cover JDK 8 ?
>>
>> Cheers
>>
>> On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun <dongj...@apache.org>
>> wrote:
>>
>>> Hi, All.
>>>
>>> I want to propose the followings.
>>>
>>> - Turn on Travis CI for Apache Spark PR queue.
>>> - Recommend this for contributors, too
>>>
>>> Currently, Spark provides Travis CI configuration file to help
>>> contributors check Scala/Java style conformance and JDK7/8 compilation
>>> easily during their preparing pull requests. Please note that it's only
>>> about static analysis.
>>>
>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>> Scalastyle is included in the step 'mvn install', too.
>>>
>>> Yep, if you turn on your Travis CI configuration, you can already see
>>> the results on your branches before making PR. I wrote this email to
>>> prevent more failures proactively and community-widely.
>>>
>>> For stability, I have been monitoring that for two weeks. It detects the
>>> failures or recovery on JDK7 builds or Java linter on Spark master branch
>>> correctly. The only exceptional case I observed rarely is `timeout` failure
>>> due to hangs of maven. But, as we know, it's happen in our Jenkins
>>> SparkPullRequestBuilder, too. I think we can ignore that.
>>>
>>> I'm sure that this will save much more community's efforts on the static
>>> errors by preventing them at the very early stage. But, there might be
>>> another reason not to do this. I'm wondering about your thoughts.
>>>
>>> I can make a Apache INFRA Jira issue for this if there is some consensus.
>>>
>>> Warmly,
>>> Dongjoon.
>>>
>>
>>
>

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-22 Thread Ted Yu

The following line was repeated twice:

- For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.

Did you intend to cover JDK 8 ?

Cheers

On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun  wrote:

> Hi, All.
>
> I want to propose the followings.
>
> - Turn on Travis CI for Apache Spark PR queue.
> - Recommend this for contributors, too
>
> Currently, Spark provides Travis CI configuration file to help
> contributors check Scala/Java style conformance and JDK7/8 compilation
> easily during their preparing pull requests. Please note that it's only
> about static analysis.
>
> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
> Scalastyle is included in the step 'mvn install', too.
>
> Yep, if you turn on your Travis CI configuration, you can already see the
> results on your branches before making PR. I wrote this email to prevent
> more failures proactively and community-widely.
>
> For stability, I have been monitoring that for two weeks. It detects the
> failures or recovery on JDK7 builds or Java linter on Spark master branch
> correctly. The only exceptional case I observed rarely is `timeout` failure
> due to hangs of maven. But, as we know, it's happen in our Jenkins
> SparkPullRequestBuilder, too. I think we can ignore that.
>
> I'm sure that this will save much more community's efforts on the static
> errors by preventing them at the very early stage. But, there might be
> another reason not to do this. I'm wondering about your thoughts.
>
> I can make a Apache INFRA Jira issue for this if there is some consensus.
>
> Warmly,
> Dongjoon.
>

Re: Quick question on spark performance

2016-05-20 Thread Ted Yu

Yash:
Can you share the JVM parameters you used ?

How many partitions are there in your data set ?

Thanks

On Fri, May 20, 2016 at 5:59 PM, Reynold Xin  wrote:

> It's probably due to GC.
>
> On Fri, May 20, 2016 at 5:54 PM, Yash Sharma  wrote:
>
>> Hi All,
>> I am here to get some expert advice on a use case I am working on.
>>
>> Cluster & job details below -
>>
>> Data - 6 Tb
>> Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps)
>>
>> Parameters-
>> --executor-memory 10G \
>> --executor-cores 6 \
>> --conf spark.dynamicAllocation.enabled=true \
>> --conf spark.dynamicAllocation.initialExecutors=15 \
>>
>> Runtime : 3 Hrs
>>
>> On monitoring the metrics I notices 10G for executors is not required
>> (since I don't have lot of groupings)
>>
>> Reducing to --executor-memory 3G, Runtime reduced to: 2 Hrs
>>
>> Question:
>> On adding more nodes now has absolutely no effect on the runtime. Is
>> there anything I can tune/change/experiment with to make the job faster.
>>
>> Workload: Mostly reduceBy's and scans.
>>
>> Would appreciate any insights and thoughts. Best Regards
>>
>>
>>
>

Re: Query parsing error for the join query between different database

2016-05-18 Thread Ted Yu

Which release of Spark / Hive are you using ?

Cheers

> On May 18, 2016, at 6:12 AM, JaeSung Jun  wrote:
> 
> Hi,
> 
> I'm working on custom data source provider, and i'm using fully qualified 
> table name in FROM clause like following :
> 
> SELECT user. uid, dept.name
> FROM userdb.user user, deptdb.dept
> WHERE user.dept_id = dept.id
> 
> and i've got the following error :
> 
> MismatchedTokenException(279!=26)
>   at 
> org.antlr.runtime.BaseRecognizer.recoverFromMismatchedToken(BaseRecognizer.java:617)
>   at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableSource(HiveParser_FromClauseParser.java:4608)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3729)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1873)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1518)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:45861)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:41516)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:41402)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:40413)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:40283)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1590)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
>   at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
>   at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>   at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
>   at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
>   at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
> 
> Any idea?
> 
> Thanks
> Jason

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Ted Yu

In master branch, behavior is the same.

Suggest opening a JIRA if you haven't done so.

On Wed, May 11, 2016 at 6:55 AM, Tony Jin  wrote:

> Hi guys,
>
> I have a problem about spark DataFrame. My spark version is 1.6.1.
> Basically, i used udf and df.withColumn to create a "new" column, and then
> i filter the values on this new columns and call show(action). I see the
> udf function (which is used to by withColumn to create the new column) is
> called twice(duplicated). And if filter on "old" column, udf only run once
> which is expected. I attached the example codes, line 30~38 shows the
> problem.
>
>  Anyone knows the internal reason? Can you give me any advices? Thank you
> very much.
>
>
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> 11
> 12
> 13
> 14
> 15
> 16
> 17
> 18
> 19
> 20
> 21
> 22
> 23
> 24
> 25
> 26
> 27
> 28
> 29
> 30
> 31
> 32
> 33
> 34
> 35
> 36
> 37
> 38
> 39
> 40
> 41
> 42
> 43
> 44
> 45
> 46
> 47
>
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
>
> scala> val df = sc.parallelize(Seq(("a", "b"), ("a1", 
> "b1"))).toDF("old","old1")
> df: org.apache.spark.sql.DataFrame = [old: string, old1: string]
>
> scala> val udfFunc = udf((s: String) => {println(s"running udf($s)"); s })
> udfFunc: org.apache.spark.sql.UserDefinedFunction = 
> UserDefinedFunction(,StringType,List(StringType))
>
> scala> val newDF = df.withColumn("new", udfFunc(df("old")))
> newDF: org.apache.spark.sql.DataFrame = [old: string, old1: string, new: 
> string]
>
> scala> newDF.show
> running udf(a)
> running udf(a1)
> +---++---+
> |old|old1|new|
> +---++---+
> |  a|   b|  a|
> | a1|  b1| a1|
> +---++---+
>
>
> scala> val filteredOnNewColumnDF = newDF.filter("new <> 'a1'")
> filteredOnNewColumnDF: org.apache.spark.sql.DataFrame = [old: string, old1: 
> string, new: string]
>
> scala> val filteredOnOldColumnDF = newDF.filter("old <> 'a1'")
> filteredOnOldColumnDF: org.apache.spark.sql.DataFrame = [old: string, old1: 
> string, new: string]
>
> scala> filteredOnNewColumnDF.show
> running udf(a)
> running udf(a)
> running udf(a1)
> +---++---+
> |old|old1|new|
> +---++---+
> |  a|   b|  a|
> +---++---+
>
>
> scala> filteredOnOldColumnDF.show
> running udf(a)
> +---++---+
> |old|old1|new|
> +---++---+
> |  a|   b|  a|
> +---++---+
>
>
>
> Best wishes.
> By Linbo
>
>

Re: Structured Streaming with Kafka source/sink

2016-05-11 Thread Ted Yu

Please see this thread:

http://search-hadoop.com/m/q3RTt9XAz651PiG/Adhoc+queries+spark+streaming=Re+Adhoc+queries+on+Spark+2+0+with+Structured+Streaming

> On May 11, 2016, at 1:47 AM, Ofir Manor  wrote:
> 
> Hi,
> I'm trying out Structured Streaming from current 2.0 branch.
> Does the branch currently support Kafka as either source or sink? I couldn't 
> find a specific JIRA or design doc for that in SPARK-8360 or in the 
> examples... Is it still targeted for 2.0?
> 
> Also, I naively assume it will look similar to hdfs or JDBC stream("path"), 
> where the path will be some sort of Kafka URI (maybe protocol + broker list + 
> topic list). Is that the current thinking?
> 
> Thanks,
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

Re: Cache Shuffle Based Operation Before Sort

2016-05-08 Thread Ted Yu

I assume there were supposed to be images following this line (which I
don't see in the email thread):

bq. Let’s look at details of execution for 10 and 100 scale factor input

Consider using 3rd party image site.

On Sun, May 8, 2016 at 5:17 PM, Ali Tootoonchian  wrote:

> Thanks for your comment.
> Which image or chart are you pointing?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Cache-Shuffle-Based-Operation-Before-Sort-tp17331p17438.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Proposal of closing some PRs and maybe some PRs abandoned by its author

2016-05-06 Thread Ted Yu

PR #10572 was listed twice.

In the future, is it possible to include the contributor's handle beside
the PR number so that people can easily recognize their own PR ?

Thanks

On Fri, May 6, 2016 at 8:45 AM, Hyukjin Kwon  wrote:

> Hi all,
>
>
> This was similar with the proposal of closing PRs before I asked.
>
> I think the PRs suggested to be closed below are closable but not very
> sure of PRs apparently abandoned by its author at least for a month.
>
> I remember the discussion about auto-closing PR before. So, I included the
> PRs as below anyway.
>
> I looked though the open every PR at this time and could make a list as
> below:
>
>
> 1. Suggested to be closed.
>
>
> https://github.com/apache/spark/pull/7739  <- not sure
>
> https://github.com/apache/spark/pull/9354
>
> https://github.com/apache/spark/pull/9451
>
> https://github.com/apache/spark/pull/10507
>
> https://github.com/apache/spark/pull/10486
>
> https://github.com/apache/spark/pull/10460
>
> https://github.com/apache/spark/pull/10967
>
> https://github.com/apache/spark/pull/10945
>
> https://github.com/apache/spark/pull/10701
>
> https://github.com/apache/spark/pull/10681
>
> https://github.com/apache/spark/pull/11766
>
>
>
> 2. Author not answering at least for a month.
>
>
> https://github.com/apache/spark/pull/9907
>
> https://github.com/apache/spark/pull/9920
>
> https://github.com/apache/spark/pull/9936
>
> https://github.com/apache/spark/pull/10052
>
> https://github.com/apache/spark/pull/10125
>
> https://github.com/apache/spark/pull/10209
>
> https://github.com/apache/spark/pull/10572 <- not sure
>
> https://github.com/apache/spark/pull/10326
>
> https://github.com/apache/spark/pull/10379
>
> https://github.com/apache/spark/pull/10403
>
> https://github.com/apache/spark/pull/10466
>
> https://github.com/apache/spark/pull/10572 <- not sure
>
> https://github.com/apache/spark/pull/10995
>
> https://github.com/apache/spark/pull/10887
>
> https://github.com/apache/spark/pull/10842
>
> https://github.com/apache/spark/pull/11005
>
> https://github.com/apache/spark/pull/11036
>
> https://github.com/apache/spark/pull/11129
>
> https://github.com/apache/spark/pull/11610
>
> https://github.com/apache/spark/pull/11729
>
> https://github.com/apache/spark/pull/11980
>
> https://github.com/apache/spark/pull/12075
>
>
> Thanks.
>
>
>

Re: SQLContext and "stable identifier required"

2016-05-03 Thread Ted Yu

Have you tried the following ?

scala> import spark.implicits._
import spark.implicits._

scala> spark
res0: org.apache.spark.sql.SparkSession =
org.apache.spark.sql.SparkSession@323d1fa2

Cheers

On Tue, May 3, 2016 at 9:16 AM, Koert Kuipers  wrote:

> with the introduction of SparkSession SQLContext changed from being a lazy
> val to a def.
> however this is troublesome if you want to do:
>
> import someDataset.sqlContext.implicits._
>
> because it is no longer a stable identifier, i think? i get:
> stable identifier required, but someDataset.sqlContext.implicits found.
>
> anyone else seen this?
>
>

Re: spark 2 segfault

2016-05-02 Thread Ted Yu

I plan to.

I am not that familiar with all the parts involved though :-)

On Mon, May 2, 2016 at 9:42 AM, Reynold Xin <r...@databricks.com> wrote:

> Definitely looks like a bug.
>
> Ted - are you looking at this?
>
>
> On Mon, May 2, 2016 at 7:15 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> Created issue:
>> https://issues.apache.org/jira/browse/SPARK-15062
>>
>> On Mon, May 2, 2016 at 6:48 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> I tried the same statement using Spark 1.6.1
>>> There was no error with default memory setting.
>>>
>>> Suggest logging a bug.
>>>
>>> On May 1, 2016, at 9:22 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>> Yeah I got that too, then I increased heap for tests to 8G to get error
>>> I showed earlier.
>>> On May 2, 2016 12:09 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>>>
>>>> Using commit hash 90787de864b58a1079c23e6581381ca8ffe7685f and
>>>> Java 1.7.0_67 , I got:
>>>>
>>>> scala> val dfComplicated = sc.parallelize(List((Map("1" -> "a"),
>>>> List("b", "c")), (Map("2" -> "b"), List("d", "e".toDF
>>>> ...
>>>> dfComplicated: org.apache.spark.sql.DataFrame = [_1:
>>>> map<string,string>, _2: array]
>>>>
>>>> scala> dfComplicated.show
>>>> java.lang.OutOfMemoryError: Java heap space
>>>>   at
>>>> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>>>>   at
>>>> org.apache.spark.unsafe.types.UTF8String.toString(UTF8String.java:821)
>>>>   at
>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
>>>> Source)
>>>>   at
>>>> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:241)
>>>>   at
>>>> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2121)
>>>>   at
>>>> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2121)
>>>>   at
>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>   at
>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>   at
>>>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>>>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>>>>   at
>>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>>>>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>>>>   at
>>>> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2121)
>>>>   at
>>>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54)
>>>>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2408)
>>>>   at org.apache.spark.sql.Dataset.org
>>>> $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2120)
>>>>   at org.apache.spark.sql.Dataset.org
>>>> $apache$spark$sql$Dataset$$collect(Dataset.scala:2127)
>>>>   at
>>>> org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1861)
>>>>   at
>>>> org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1860)
>>>>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2438)
>>>>   at org.apache.spark.sql.Dataset.head(Dataset.scala:1860)
>>>>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2077)
>>>>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:238)
>>>>   at org.apache.spark.sql.Dataset.show(Dataset.scala:529)
>>>>   at org.apache.spark.sql.Dataset.show(Dataset.scala:489)
>>>>   at org.apache.spark.sql.Dataset.show(Dataset.scala:498)
>>>>   ... 6 elided
>>>>
>>>> scala>
>>>>
>>>> On Sun, May 1, 2016 at 8:34 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> by removing dependencies it turns into a different error, see below.
>>>>> the test is simply writing a DataFrame out to file and reading it back
>>>>> in. i see the error for all dat

Re: spark 2 segfault

2016-05-02 Thread Ted Yu

I tried the same statement using Spark 1.6.1
There was no error with default memory setting. 

Suggest logging a bug. 

> On May 1, 2016, at 9:22 PM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> Yeah I got that too, then I increased heap for tests to 8G to get error I 
> showed earlier.
> 
>> On May 2, 2016 12:09 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>> Using commit hash 90787de864b58a1079c23e6581381ca8ffe7685f and Java 1.7.0_67 
>> , I got:
>> 
>> scala> val dfComplicated = sc.parallelize(List((Map("1" -> "a"), List("b", 
>> "c")), (Map("2" -> "b"), List("d", "e".toDF
>> ...
>> dfComplicated: org.apache.spark.sql.DataFrame = [_1: map<string,string>, _2: 
>> array]
>> 
>> scala> dfComplicated.show
>> java.lang.OutOfMemoryError: Java heap space
>>   at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>>   at org.apache.spark.unsafe.types.UTF8String.toString(UTF8String.java:821)
>>   at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
>>  Source)
>>   at 
>> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:241)
>>   at 
>> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2121)
>>   at 
>> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2121)
>>   at 
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>>   at 
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>>   at 
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>>   at 
>> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2121)
>>   at 
>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54)
>>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2408)
>>   at 
>> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2120)
>>   at 
>> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2127)
>>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1861)
>>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1860)
>>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2438)
>>   at org.apache.spark.sql.Dataset.head(Dataset.scala:1860)
>>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2077)
>>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:238)
>>   at org.apache.spark.sql.Dataset.show(Dataset.scala:529)
>>   at org.apache.spark.sql.Dataset.show(Dataset.scala:489)
>>   at org.apache.spark.sql.Dataset.show(Dataset.scala:498)
>>   ... 6 elided
>> 
>> scala>
>> 
>>> On Sun, May 1, 2016 at 8:34 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>> by removing dependencies it turns into a different error, see below.
>>> the test is simply writing a DataFrame out to file and reading it back in. 
>>> i see the error for all data sources (json, parquet, etc.).
>>> 
>>> this is the data that i write out and read back in:
>>> val dfComplicated = sc.parallelize(List((Map("1" -> "a"), List("b", "c")), 
>>> (Map("2" -> "b"), List("d", "e".toDF 
>>> 
>>> 
>>> [info]   java.lang.RuntimeException: Error while decoding: 
>>> java.lang.NegativeArraySizeException
>>> [info] createexternalrow(if (isnull(input[0, map<string,string>])) null 
>>> else staticinvoke(class 
>>> org.apache.spark.sql.catalyst.util.ArrayBasedMapData$, ObjectType(interface 
>>> scala.collection.Map), toScalaMap, staticinvoke(class 
>>> scala.collection.mutable.WrappedArray$, ObjectType(interface 
>>> scala.collection.Seq), make, 
>>> mapobjects(lambdavariable(MapObjects_loopValue16, MapObjects_loopIsNull17, 
>>> StringType), lambdavariable(MapObjects_loopValue16, 
>>> MapObjects_loopIsNull17, StringType).toString, input[0, 
>>> map<string,string>].keyArr

Re: Number of partitions for binaryFiles

2016-04-26 Thread Ted Yu

>From what I understand, Spark code was written this way because you don't
end up with very small partitions.

In your case, look at the size of the cluster.
If 66 partitions can make good use of your cluster, it should be fine.

On Tue, Apr 26, 2016 at 2:27 PM, Ulanov, Alexander <alexander.ula...@hpe.com
> wrote:

> Hi Ted,
>
>
>
> I have 36 files of size ~600KB and the rest 74 are about 400KB.
>
>
>
> Is there a workaround rather than changing Sparks code?
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Tuesday, April 26, 2016 1:22 PM
> *To:* Ulanov, Alexander <alexander.ula...@hpe.com>
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Number of partitions for binaryFiles
>
>
>
> Here is the body of StreamFileInputFormat#setMinPartitions :
>
>
>
>   def setMinPartitions(context: JobContext, minPartitions: Int) {
>
> val totalLen =
> listStatus(context).asScala.filterNot(_.isDirectory).map(_.getLen).sum
>
> val maxSplitSize = math.ceil(totalLen / math.max(minPartitions,
> 1.0)).toLong
>
> super.setMaxSplitSize(maxSplitSize)
>
>
>
> I guess what happened was that among the 100 files you had, there were ~60
> files whose sizes were much bigger than the rest.
>
> According to the way max split size is computed above, you ended up with
> fewer partitions.
>
>
>
> I just performed a test using local directory where 3 files were
> significantly larger than the rest and reproduced what you observed.
>
>
>
> Cheers
>
>
>
> On Tue, Apr 26, 2016 at 11:10 AM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Dear Spark developers,
>
>
>
> I have 100 binary files in local file system that I want to load into
> Spark RDD. I need the data from each file to be in a separate partition.
> However, I cannot make it happen:
>
>
>
> scala> sc.binaryFiles("/data/subset").partitions.size
>
> res5: Int = 66
>
>
>
> The “minPartitions” parameter does not seems to help:
>
> scala> sc.binaryFiles("/data/subset", minPartitions = 100).partitions.size
>
> res8: Int = 66
>
>
>
> At the same time, Spark produces the required number of partitions with
> sc.textFiles (though I cannot use it because my files are binary):
>
> scala> sc.textFile("/data/subset").partitions.size
>
> res9: Int = 100
>
>
>
> Could you suggest how to force Spark to load binary files each in a
> separate partition?
>
>
>
> Best regards, Alexander
>
>
>

Re: Number of partitions for binaryFiles

2016-04-26 Thread Ted Yu

Here is the body of StreamFileInputFormat#setMinPartitions :

  def setMinPartitions(context: JobContext, minPartitions: Int) {
val totalLen =
listStatus(context).asScala.filterNot(_.isDirectory).map(_.getLen).sum
val maxSplitSize = math.ceil(totalLen / math.max(minPartitions,
1.0)).toLong
super.setMaxSplitSize(maxSplitSize)

I guess what happened was that among the 100 files you had, there were ~60
files whose sizes were much bigger than the rest.
According to the way max split size is computed above, you ended up with
fewer partitions.

I just performed a test using local directory where 3 files were
significantly larger than the rest and reproduced what you observed.

Cheers

On Tue, Apr 26, 2016 at 11:10 AM, Ulanov, Alexander <
alexander.ula...@hpe.com> wrote:

> Dear Spark developers,
>
>
>
> I have 100 binary files in local file system that I want to load into
> Spark RDD. I need the data from each file to be in a separate partition.
> However, I cannot make it happen:
>
>
>
> scala> sc.binaryFiles("/data/subset").partitions.size
>
> res5: Int = 66
>
>
>
> The “minPartitions” parameter does not seems to help:
>
> scala> sc.binaryFiles("/data/subset", minPartitions = 100).partitions.size
>
> res8: Int = 66
>
>
>
> At the same time, Spark produces the required number of partitions with
> sc.textFiles (though I cannot use it because my files are binary):
>
> scala> sc.textFile("/data/subset").partitions.size
>
> res9: Int = 100
>
>
>
> Could you suggest how to force Spark to load binary files each in a
> separate partition?
>
>
>
> Best regards, Alexander
>

Re: Cache Shuffle Based Operation Before Sort

2016-04-25 Thread Ted Yu

Interesting.

bq. details of execution for 10 and 100 scale factor input

Looks like some chart (or image) didn't go through.

FYI

On Mon, Apr 25, 2016 at 12:50 PM, Ali Tootoonchian  wrote:

> Caching shuffle RDD before the sort process improves system performance.
> SQL
> planner can be intelligent to cache join, aggregate or sort data frame
> before executing next sort process.
>
> For any sort process two job is created by spark, first one is responsible
> for producing range boundary for shuffle partition and second one complete
> sort process by creating a new shuffle RDD.
>
> When an input of sort process is output of other shuffle process then
> reduce
> part of shuffle RDD is re-evaluated and the intermediate  shuffle data is
> read twice. If input shuffle RDD (exchange based data frame) is saved, sort
> process can be completed faster. Remember that Spark saves RDD in parquet
> format which usually compressed and its size is smaller than original data.
>
> Let’s look at an example,
> The following query is modified version of q3 of TPCH test bench.
> tpchQuery =
>"""
> |select *
> |from
> | customer,
> | orders,
> | lineitem
> |where
> | c_mktsegment = 'MACHINERY'
> | and c_custkey = o_custkey
> | and l_orderkey = o_orderkey
> | and o_orderdate < '1995-03-15'
> | and l_shipdate > '1995-03-15'
> |order by
> | o_orderdate
>""".stripMargin
>
> The query can be executed in one step using current Spark SQL planner. The
> other approach for execute this query is two steps.
> Compute and cache output of join process
> Execute order by command
> Following command show how second approach can be implemented
>
> tpchQuery =
>   """
> |select *
> |from
> | customer,
> | orders,
> | lineitem
> |where
> | c_mktsegment = 'MACHINERY'
> | and c_custkey = o_custkey
> | and l_orderkey = o_orderkey
> | and o_orderdate < '1995-03-15'
> | and l_shipdate > '1995-03-15'
>   """.stripMargin
> val joinDf = sqlContext.sql(tpchQuery).cache
> val queryRes = joinDf.sort("o_orderdate")
>
> Let’s look at details of execution for 10 and 100 scale factor input
>
>
> By comparing stage 4, 9, 10 and 15, 20, 21 of two approaches, you can find
> out that amount of data is read during sort process can be reduced by
> factor
> 2.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Cache-Shuffle-Based-Operation-Before-Sort-tp17331.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: RFC: Remote "HBaseTest" from examples?

2016-04-21 Thread Ted Yu

Zhan:
I have mentioned the JIRA numbers in the thread starting with (note the
typo in subject of this thread):

RFC: Remove ...

On Thu, Apr 21, 2016 at 1:28 PM, Zhan Zhang  wrote:

> FYI: There are several pending patches for DataFrame support on top of
> HBase.
>
> Thanks.
>
> Zhan Zhang
>
> On Apr 20, 2016, at 2:43 AM, Saisai Shao  wrote:
>
> +1, HBaseTest in Spark Example is quite old and obsolete, the HBase
> connector in HBase repo has evolved a lot, it would be better to guide user
> to refer to that not here in Spark example. So good to remove it.
>
> Thanks
> Saisai
>
> On Wed, Apr 20, 2016 at 1:41 AM, Josh Rosen 
> wrote:
>
>> +1; I think that it's preferable for code examples, especially
>> third-party integration examples, to live outside of Spark.
>>
>> On Tue, Apr 19, 2016 at 10:29 AM Reynold Xin  wrote:
>>
>>> Yea in general I feel examples that bring in a large amount of
>>> dependencies should be outside Spark.
>>>
>>>
>>> On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin 
>>> wrote:
>>>
 Hey all,

 Two reasons why I think we should remove that from the examples:

 - HBase now has Spark integration in its own repo, so that really
 should be the template for how to use HBase from Spark, making that
 example less useful, even misleading.

 - It brings up a lot of extra dependencies that make the size of the
 Spark distribution grow.

 Any reason why we shouldn't drop that example?

 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>
>

Re: [Spark-SQL] Reduce Shuffle Data by pushing filter toward storage

2016-04-21 Thread Ted Yu

Interesting analysis. 

Can you log a JIRA ?

> On Apr 21, 2016, at 11:07 AM, atootoonchian  wrote:
> 
> SQL query planner can have intelligence to push down filter commands towards
> the storage layer. If we optimize the query planner such that the IO to the
> storage is reduced at the cost of running multiple filters (i.e., compute),
> this should be desirable when the system is IO bound. An example to prove
> the case in point is below from TPCH test bench:
> 
> Let’s look at query q19 of TPCH test bench.
> select
>sum(l_extendedprice* (1 - l_discount)) as revenue
> from lineitem, part
> where
>  ( p_partkey = l_partkey
>and p_brand = 'Brand#12'
>and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
>and l_quantity >= 1 and l_quantity <= 1 + 10
>and p_size between 1 and 5
>and l_shipmode in ('AIR', 'AIR REG')
>and l_shipinstruct = 'DELIVER IN PERSON')
>  or
>  ( p_partkey = l_partkey
>and p_brand = 'Brand#23'
>and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
>and l_quantity >= 10 and l_quantity <= 10 + 10
>and p_size between 1 and 10
>and l_shipmode in ('AIR', 'AIR REG')
>and l_shipinstruct = 'DELIVER IN PERSON')
>  or
>  ( p_partkey = l_partkey
>and p_brand = 'Brand#34'
>and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
>and l_quantity >= 20 and l_quantity <= 20 + 10
>and p_size between 1 and 15
>and l_shipmode in ('AIR', 'AIR REG')
>and l_shipinstruct = 'DELIVER IN PERSON')
> 
> Latest version of Spark creates a following planner (not exactly, more
> readable planner) to execute q19.
> Aggregate [(sum(cast((l_extendedprice * (1.0 - l_discount))
>  Project [l_extendedprice,l_discount]
>Join Inner, Some(((p_partkey = l_partkey) && 
> ((
>   (p_brand = Brand#12) && 
>p_container IN (SM CASE,SM BOX,SM PACK,SM PKG)) && 
>   (l_quantity >= 1.0)) && (l_quantity <= 11.0)) && 
>   (p_size <= 5)) || 
> (p_brand = Brand#23) && 
> p_container IN (MED BAG,MED BOX,MED PKG,MED PACK)) && 
>(l_quantity >= 10.0)) && (l_quantity <= 20.0)) && 
>(p_size <= 10))) || 
> (p_brand = Brand#34) && 
> p_container IN (LG CASE,LG BOX,LG PACK,LG PKG)) && 
>(l_quantity >= 20.0)) && (l_quantity <= 30.0)) && 
>(p_size <= 15)
>  Project [l_partkey, l_quantity, l_extendedprice, l_discount]
>Filter ((isnotnull(l_partkey) && 
>(isnotnull(l_shipinstruct) && 
>(l_shipmode IN (AIR,AIR REG) && 
>(l_shipinstruct = DELIVER IN PERSON
>  LogicalRDD [l_orderkey, l_partkey, l_suppkey, l_linenumber,
> l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus,
> l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode,
> l_comment], MapPartitionsRDD[316] 
>  Project [p_partkey, p_brand, p_size, p_container]
>Filter ((isnotnull(p_partkey) && 
>(isnotnull(p_size) && 
>(cast(cast(p_size as decimal(20,0)) as int) >= 1)))
>  LogicalRDD [p_partkey, p_name, p_mfgr, p_brand, p_type, p_size,
> p_container, p_retailprice, p_comment], MapPartitionsRDD[314]
> 
> As you see only three filter commands are pushed before join process is
> executed.
>  l_shipmode IN (AIR,AIR REG)
>  l_shipinstruct = DELIVER IN PERSON
>  (cast(cast(p_size as decimal(20,0)) as int) >= 1)
> 
> And the following filters are applied during the join process
>  p_brand = Brand#12
>  p_container IN (SM CASE,SM BOX,SM PACK,SM PKG) 
>  l_quantity >= 1.0 && l_quantity <= 11.0 
>  p_size <= 5  
>  p_brand = Brand#23 
>  p_container IN (MED BAG,MED BOX,MED PKG,MED PACK) 
>  l_quantity >= 10.0 && l_quantity <= 20.0 
>  p_size <= 10 
>  p_brand = Brand#34 
>  p_container IN (LG CASE,LG BOX,LG PACK,LG PKG) 
>  l_quantity >= 20.0 && l_quantity <= 30.0
>  p_size <= 15
> 
> Let’s look at the following sequence of SQL commands which produce same
> result.
> val partDfFilter = sqlContext.sql("""
>|select p_brand, p_partkey from part 
>|where
>| (p_brand = 'Brand#12'
>|   and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
>|   and p_size between 1 and 5)
>| or
>| (p_brand = 'Brand#23'
>|   and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
>|   and p_size between 1 and 10)
>| or
>| (p_brand = 'Brand#34'
>|   and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
>|   and p_size between 1 and 15)
>   """.stripMargin)
> 
> val itemLineDfFilter = sqlContext.sql("""
>|select 
>| l_partkey, l_quantity, l_extendedprice, l_discount from lineitem
>|where
>| (l_quantity >= 1 and l_quantity <= 30
>|   and l_shipmode in ('AIR', 'AIR REG')
>|   and l_shipinstruct = 'DELIVER IN PERSON')
>  """.stripMargin)
> 
>

Re: Improving system design logging in spark

2016-04-20 Thread Ted Yu

Interesting.

For #3:

bq. reading data from,

I guess you meant reading from disk.

On Wed, Apr 20, 2016 at 10:45 AM, atootoonchian  wrote:

> Current spark logging mechanism can be improved by adding the following
> parameters. It will help in understanding system bottlenecks and provide
> useful guidelines for Spark application developer to design an optimized
> application.
>
> 1. Shuffle Read Local Time: Time for a task to read shuffle data from local
> storage.
> 2. Shuffle Read Remote Time: Time for a  task to read shuffle data from
> remote node.
> 3. Distribution processing time between computation, I/O, network: Show
> distribution of processing time of each task between computation, reading
> data from, and reading data from network.
> 4. Average I/O bandwidth: Average time of I/O throughput for each task when
> it fetches data from disk.
> 5. Average Network bandwidth: Average network throughput for each task when
> it fetches data from remote nodes.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Improving-system-design-logging-in-spark-tp17291.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu

Clarification: in my previous email, I was not talking
about spark-streaming-flume artifact or spark-streaming-kafka artifact.

I was talking about examples for these projects, such
as examples//src/main/python/streaming/flume_wordcount.py

On Tue, Apr 19, 2016 at 11:10 AM, Marcelo Vanzin <van...@cloudera.com>
wrote:

> On Tue, Apr 19, 2016 at 11:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> The same question can be asked w.r.t. examples for other projects, such
>> as flume and kafka.
>>
>
> The main difference being that flume and kafka integration are part of
> Spark itself. HBase integration is not.
>
>
>
>> On Tue, Apr 19, 2016 at 11:01 AM, Marcin Tustin <mtus...@handybook.com>
>> wrote:
>>
>>> Let's posit that the spark example is much better than what is available
>>> in HBase. Why is that a reason to keep it within Spark?
>>>
>>> On Tue, Apr 19, 2016 at 1:59 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> bq. HBase's current support, even if there are bugs or things that
>>>> still need to be done, is much better than the Spark example
>>>>
>>>> In my opinion, a simple example that works is better than a buggy
>>>> package.
>>>>
>>>> I hope before long the hbase-spark module in HBase can arrive at a
>>>> state which we can advertise as mature - but we're not there yet.
>>>>
>>>> On Tue, Apr 19, 2016 at 10:50 AM, Marcelo Vanzin <van...@cloudera.com>
>>>> wrote:
>>>>
>>>>> You're completely missing my point. I'm saying that HBase's current
>>>>> support, even if there are bugs or things that still need to be done,
>>>>> is much better than the Spark example, which is basically a call to
>>>>> "SparkContext.hadoopRDD".
>>>>>
>>>>> Spark's example is not helpful in learning how to build an HBase
>>>>> application on Spark, and clashes head on with how the HBase
>>>>> developers think it should be done. That, and because it brings too
>>>>> many dependencies for something that is not really useful, is why I'm
>>>>> suggesting removing it.
>>>>>
>>>>>
>>>>> On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>> > There is an Open JIRA for fixing the documentation: HBASE-15473
>>>>> >
>>>>> > I would say the refguide link you provided should not be considered
>>>>> as
>>>>> > complete.
>>>>> >
>>>>> > Note it is marked as Blocker by Sean B.
>>>>> >
>>>>> > On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin <
>>>>> van...@cloudera.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> You're entitled to your own opinions.
>>>>> >>
>>>>> >> While you're at it, here's some much better documentation, from the
>>>>> >> HBase project themselves, than what the Spark example provides:
>>>>> >> http://hbase.apache.org/book.html#spark
>>>>> >>
>>>>> >> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu <yuzhih...@gmail.com>
>>>>> wrote:
>>>>> >> > bq. it's actually in use right now in spite of not being in any
>>>>> upstream
>>>>> >> > HBase release
>>>>> >> >
>>>>> >> > If it is not in upstream, then it is not relevant for discussion
>>>>> on
>>>>> >> > Apache
>>>>> >> > mailing list.
>>>>> >> >
>>>>> >> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <
>>>>> van...@cloudera.com>
>>>>> >> > wrote:
>>>>> >> >>
>>>>> >> >> Alright, if you prefer, I'll say "it's actually in use right now
>>>>> in
>>>>> >> >> spite of not being in any upstream HBase release", and it's more
>>>>> >> >> useful than a single example file in the Spark repo for those who
>>>>> >> >> really want to integrate with HBase.
>>>>> >> >>
>>>>> >> >> Spark's example is really very trivial (just uses one of HBase's
>>>>> input
>>>>> >> >> formats), which makes it not v

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu

The same question can be asked w.r.t. examples for other projects,
such as flume
and kafka.

On Tue, Apr 19, 2016 at 11:01 AM, Marcin Tustin <mtus...@handybook.com>
wrote:

> Let's posit that the spark example is much better than what is available
> in HBase. Why is that a reason to keep it within Spark?
>
> On Tue, Apr 19, 2016 at 1:59 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> bq. HBase's current support, even if there are bugs or things that still
>> need to be done, is much better than the Spark example
>>
>> In my opinion, a simple example that works is better than a buggy package.
>>
>> I hope before long the hbase-spark module in HBase can arrive at a state
>> which we can advertise as mature - but we're not there yet.
>>
>> On Tue, Apr 19, 2016 at 10:50 AM, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>>
>>> You're completely missing my point. I'm saying that HBase's current
>>> support, even if there are bugs or things that still need to be done,
>>> is much better than the Spark example, which is basically a call to
>>> "SparkContext.hadoopRDD".
>>>
>>> Spark's example is not helpful in learning how to build an HBase
>>> application on Spark, and clashes head on with how the HBase
>>> developers think it should be done. That, and because it brings too
>>> many dependencies for something that is not really useful, is why I'm
>>> suggesting removing it.
>>>
>>>
>>> On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>> > There is an Open JIRA for fixing the documentation: HBASE-15473
>>> >
>>> > I would say the refguide link you provided should not be considered as
>>> > complete.
>>> >
>>> > Note it is marked as Blocker by Sean B.
>>> >
>>> > On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin <van...@cloudera.com>
>>> > wrote:
>>> >>
>>> >> You're entitled to your own opinions.
>>> >>
>>> >> While you're at it, here's some much better documentation, from the
>>> >> HBase project themselves, than what the Spark example provides:
>>> >> http://hbase.apache.org/book.html#spark
>>> >>
>>> >> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>> >> > bq. it's actually in use right now in spite of not being in any
>>> upstream
>>> >> > HBase release
>>> >> >
>>> >> > If it is not in upstream, then it is not relevant for discussion on
>>> >> > Apache
>>> >> > mailing list.
>>> >> >
>>> >> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <
>>> van...@cloudera.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Alright, if you prefer, I'll say "it's actually in use right now in
>>> >> >> spite of not being in any upstream HBase release", and it's more
>>> >> >> useful than a single example file in the Spark repo for those who
>>> >> >> really want to integrate with HBase.
>>> >> >>
>>> >> >> Spark's example is really very trivial (just uses one of HBase's
>>> input
>>> >> >> formats), which makes it not very useful as a blueprint for
>>> developing
>>> >> >> HBase apps with Spark.
>>> >> >>
>>> >> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu <yuzhih...@gmail.com>
>>> wrote:
>>> >> >> > bq. I wouldn't call it "incomplete".
>>> >> >> >
>>> >> >> > I would call it incomplete.
>>> >> >> >
>>> >> >> > Please see HBASE-15333 'Enhance the filter to handle short,
>>> integer,
>>> >> >> > long,
>>> >> >> > float and double' which is a bug fix.
>>> >> >> >
>>> >> >> > Please exclude presence of related of module in vendor distro
>>> from
>>> >> >> > this
>>> >> >> > discussion.
>>> >> >> >
>>> >> >> > Thanks
>>> >> >> >
>>> >> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin
>>> >> >> > <van...@cloudera.com>
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu <yuzhih...@gmail.com>
>>> >> >> >> wrote:
>>> >> >> >> > I want to note that the hbase-spark module in HBase is
>>> incomplete.
>>> >> >> >> > Zhan
>>> >> >> >> > has
>>> >> >> >> > several patches pending review.
>>> >> >> >>
>>> >> >> >> I wouldn't call it "incomplete". Lots of functionality is there,
>>> >> >> >> which
>>> >> >> >> doesn't mean new ones, or more efficient implementations of
>>> existing
>>> >> >> >> ones, can't be added.
>>> >> >> >>
>>> >> >> >> > hbase-spark module is currently only in master branch which
>>> would
>>> >> >> >> > be
>>> >> >> >> > released as 2.0
>>> >> >> >>
>>> >> >> >> Just as a side note, it's part of CDH 5.7.0, not that it matters
>>> >> >> >> much
>>> >> >> >> for upstream HBase.
>>> >> >> >>
>>> >> >> >> --
>>> >> >> >> Marcelo
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Marcelo
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Marcelo
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led
> by Fidelity
>
>

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu

bq. HBase's current support, even if there are bugs or things that still
need to be done, is much better than the Spark example

In my opinion, a simple example that works is better than a buggy package.

I hope before long the hbase-spark module in HBase can arrive at a state
which we can advertise as mature - but we're not there yet.

On Tue, Apr 19, 2016 at 10:50 AM, Marcelo Vanzin <van...@cloudera.com>
wrote:

> You're completely missing my point. I'm saying that HBase's current
> support, even if there are bugs or things that still need to be done,
> is much better than the Spark example, which is basically a call to
> "SparkContext.hadoopRDD".
>
> Spark's example is not helpful in learning how to build an HBase
> application on Spark, and clashes head on with how the HBase
> developers think it should be done. That, and because it brings too
> many dependencies for something that is not really useful, is why I'm
> suggesting removing it.
>
>
> On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> > There is an Open JIRA for fixing the documentation: HBASE-15473
> >
> > I would say the refguide link you provided should not be considered as
> > complete.
> >
> > Note it is marked as Blocker by Sean B.
> >
> > On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin <van...@cloudera.com>
> > wrote:
> >>
> >> You're entitled to your own opinions.
> >>
> >> While you're at it, here's some much better documentation, from the
> >> HBase project themselves, than what the Spark example provides:
> >> http://hbase.apache.org/book.html#spark
> >>
> >> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >> > bq. it's actually in use right now in spite of not being in any
> upstream
> >> > HBase release
> >> >
> >> > If it is not in upstream, then it is not relevant for discussion on
> >> > Apache
> >> > mailing list.
> >> >
> >> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <van...@cloudera.com
> >
> >> > wrote:
> >> >>
> >> >> Alright, if you prefer, I'll say "it's actually in use right now in
> >> >> spite of not being in any upstream HBase release", and it's more
> >> >> useful than a single example file in the Spark repo for those who
> >> >> really want to integrate with HBase.
> >> >>
> >> >> Spark's example is really very trivial (just uses one of HBase's
> input
> >> >> formats), which makes it not very useful as a blueprint for
> developing
> >> >> HBase apps with Spark.
> >> >>
> >> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> >> >> > bq. I wouldn't call it "incomplete".
> >> >> >
> >> >> > I would call it incomplete.
> >> >> >
> >> >> > Please see HBASE-15333 'Enhance the filter to handle short,
> integer,
> >> >> > long,
> >> >> > float and double' which is a bug fix.
> >> >> >
> >> >> > Please exclude presence of related of module in vendor distro from
> >> >> > this
> >> >> > discussion.
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin
> >> >> > <van...@cloudera.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu <yuzhih...@gmail.com>
> >> >> >> wrote:
> >> >> >> > I want to note that the hbase-spark module in HBase is
> incomplete.
> >> >> >> > Zhan
> >> >> >> > has
> >> >> >> > several patches pending review.
> >> >> >>
> >> >> >> I wouldn't call it "incomplete". Lots of functionality is there,
> >> >> >> which
> >> >> >> doesn't mean new ones, or more efficient implementations of
> existing
> >> >> >> ones, can't be added.
> >> >> >>
> >> >> >> > hbase-spark module is currently only in master branch which
> would
> >> >> >> > be
> >> >> >> > released as 2.0
> >> >> >>
> >> >> >> Just as a side note, it's part of CDH 5.7.0, not that it matters
> >> >> >> much
> >> >> >> for upstream HBase.
> >> >> >>
> >> >> >> --
> >> >> >> Marcelo
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Marcelo
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Marcelo
> >
> >
>
>
>
> --
> Marcelo
>

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu

'bq.' is used in JIRA to quote what other people have said.

On Tue, Apr 19, 2016 at 10:42 AM, Reynold Xin <r...@databricks.com> wrote:

> Ted - what's the "bq" thing? Are you using some 3rd party (e.g. Atlassian)
> syntax? They are not being rendered in email.
>
>
> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> bq. it's actually in use right now in spite of not being in any upstream
>> HBase release
>>
>> If it is not in upstream, then it is not relevant for discussion on
>> Apache mailing list.
>>
>> On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>>
>>> Alright, if you prefer, I'll say "it's actually in use right now in
>>> spite of not being in any upstream HBase release", and it's more
>>> useful than a single example file in the Spark repo for those who
>>> really want to integrate with HBase.
>>>
>>> Spark's example is really very trivial (just uses one of HBase's input
>>> formats), which makes it not very useful as a blueprint for developing
>>> HBase apps with Spark.
>>>
>>> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>> > bq. I wouldn't call it "incomplete".
>>> >
>>> > I would call it incomplete.
>>> >
>>> > Please see HBASE-15333 'Enhance the filter to handle short, integer,
>>> long,
>>> > float and double' which is a bug fix.
>>> >
>>> > Please exclude presence of related of module in vendor distro from this
>>> > discussion.
>>> >
>>> > Thanks
>>> >
>>> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin <van...@cloudera.com>
>>> > wrote:
>>> >>
>>> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>> >> > I want to note that the hbase-spark module in HBase is incomplete.
>>> Zhan
>>> >> > has
>>> >> > several patches pending review.
>>> >>
>>> >> I wouldn't call it "incomplete". Lots of functionality is there, which
>>> >> doesn't mean new ones, or more efficient implementations of existing
>>> >> ones, can't be added.
>>> >>
>>> >> > hbase-spark module is currently only in master branch which would be
>>> >> > released as 2.0
>>> >>
>>> >> Just as a side note, it's part of CDH 5.7.0, not that it matters much
>>> >> for upstream HBase.
>>> >>
>>> >> --
>>> >> Marcelo
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>
>>
>

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu

bq. it's actually in use right now in spite of not being in any upstream
HBase release

If it is not in upstream, then it is not relevant for discussion on Apache
mailing list.

On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <van...@cloudera.com>
wrote:

> Alright, if you prefer, I'll say "it's actually in use right now in
> spite of not being in any upstream HBase release", and it's more
> useful than a single example file in the Spark repo for those who
> really want to integrate with HBase.
>
> Spark's example is really very trivial (just uses one of HBase's input
> formats), which makes it not very useful as a blueprint for developing
> HBase apps with Spark.
>
> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> > bq. I wouldn't call it "incomplete".
> >
> > I would call it incomplete.
> >
> > Please see HBASE-15333 'Enhance the filter to handle short, integer,
> long,
> > float and double' which is a bug fix.
> >
> > Please exclude presence of related of module in vendor distro from this
> > discussion.
> >
> > Thanks
> >
> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin <van...@cloudera.com>
> > wrote:
> >>
> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >> > I want to note that the hbase-spark module in HBase is incomplete.
> Zhan
> >> > has
> >> > several patches pending review.
> >>
> >> I wouldn't call it "incomplete". Lots of functionality is there, which
> >> doesn't mean new ones, or more efficient implementations of existing
> >> ones, can't be added.
> >>
> >> > hbase-spark module is currently only in master branch which would be
> >> > released as 2.0
> >>
> >> Just as a side note, it's part of CDH 5.7.0, not that it matters much
> >> for upstream HBase.
> >>
> >> --
> >> Marcelo
> >
> >
>
>
>
> --
> Marcelo
>

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu

bq. create a separate tarball for them

Probably another thread can be started for the above.
I am fine with it.

On Tue, Apr 19, 2016 at 10:34 AM, Marcelo Vanzin 
wrote:

> On Tue, Apr 19, 2016 at 10:28 AM, Reynold Xin  wrote:
> > Yea in general I feel examples that bring in a large amount of
> dependencies
> > should be outside Spark.
>
> Another option to avoid the dependency problem is to not ship examples
> in the distribution, and maybe create a separate tarball for them;
> removing HBaseTest only solves one of the dependency problems. Since
> we have examples for flume and kafka, for example, the Spark
> distribution ends up shipping flume and kafka jars (and a bunch of
> other things).
>
> > On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin 
> > wrote:
> >>
> >> Hey all,
> >>
> >> Two reasons why I think we should remove that from the examples:
> >>
> >> - HBase now has Spark integration in its own repo, so that really
> >> should be the template for how to use HBase from Spark, making that
> >> example less useful, even misleading.
> >>
> >> - It brings up a lot of extra dependencies that make the size of the
> >> Spark distribution grow.
> >>
> >> Any reason why we shouldn't drop that example?
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Ted Yu

bq. I wouldn't call it "incomplete".

I would call it incomplete.

Please see HBASE-15333 'Enhance the filter to handle short, integer, long,
float and double' which is a bug fix.

Please exclude presence of related of module in vendor distro from this
discussion.

Thanks

On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin <van...@cloudera.com>
wrote:

> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> > I want to note that the hbase-spark module in HBase is incomplete. Zhan
> has
> > several patches pending review.
>
> I wouldn't call it "incomplete". Lots of functionality is there, which
> doesn't mean new ones, or more efficient implementations of existing
> ones, can't be added.
>
> > hbase-spark module is currently only in master branch which would be
> > released as 2.0
>
> Just as a side note, it's part of CDH 5.7.0, not that it matters much
> for upstream HBase.
>
> --
> Marcelo
>

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu

bq. there should be more committers or they are asked to be more active.

Bingo.

bq. they can't be closed only because it is "expired" with a copy and
pasted message.

+1

On Mon, Apr 18, 2016 at 9:14 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> I don't think asking committers to be more active is impractical. I am
> not too sure if other projects apply the same rules here but
>
> I think if a project is being more popular, then I think it is appropriate
> that there should be more committers or they are asked to be more active.
>
>
> In addition, I believe there are a lot of PRs waiting for committer's
> comments.
>
>
> If committers are too busy to review a PR, then I think they better ask
> authors to provide the evidence to decide, maybe with a message such as
>
> "I am currently too busy to review or decide. Could you please add some 
> evidence/benchmark/performance
> test or survey for demands?"
>
>
> If the evidence is not enough or not easy to see, then they can ask to
> simplify the evidence or make a proper conclusion, maybe with a message
> such as
>
> "I think the evidence is not enough/trustable because  Could you
> please simplify/provide some more evidence?".
>
>
>
> Or, I think they can be manually closed with a explicit message such as
>
> "This is closed for now because we are not sure for this patch because.."
>
>
> I think they can't be closed only because it is "expired" with a copy and
> pasted message.
>
>
>
> 2016-04-19 12:46 GMT+09:00 Nicholas Chammas <nicholas.cham...@gmail.com>:
>
>> Relevant: https://github.com/databricks/spark-pr-dashboard/issues/1
>>
>> A lot of this was discussed a while back when the PR Dashboard was first
>> introduced, and several times before and after that as well. (e.g. August
>> 2014
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-stale-PRs-td8015.html>
>> )
>>
>> If there is not enough momentum to build the tooling that people are
>> discussing here, then perhaps Reynold's suggestion is the most practical
>> one that is likely to see the light of day.
>>
>> I think asking committers to be more active in commenting on PRs is
>> theoretically the correct thing to do, but impractical. I'm not a
>> committer, but I would guess that most of them are already way
>> overcommitted (ha!) and asking them to do more just won't yield results.
>>
>> We've had several instances in the past where we all tried to rally
>> <https://mail-archives.apache.org/mod_mbox/spark-dev/201412.mbox/%3ccaohmdzer4cg_wxgktoxsg8s34krqezygjfzdoymgu9vhyjb...@mail.gmail.com%3E>
>> and be more proactive about giving feedback, closing PRs, and nudging
>> contributors who have gone silent. My observation is that the level of
>> energy required to "properly" curate PR activity in that way is simply not
>> sustainable. People can do it for a few weeks and then things revert to the
>> way they are now.
>>
>> Perhaps the missing link that would make this sustainable is better
>> tooling. If you think so and can sling some Javascript, you might want
>> to contribute to the PR Dashboard <https://spark-prs.appspot.com/>.
>>
>> Perhaps the missing link is something else: A different PR review
>> process; more committers; a higher barrier to contributing; a combination
>> thereof; etc...
>>
>> Also relevant: http://danluu.com/discourage-oss/
>>
>> By the way, some people noted that closing PRs may discourage
>> contributors. I think our open PR count alone is very discouraging. Under
>> what circumstances would you feel encouraged to open a PR against a project
>> that has hundreds of open PRs, some from many, many months ago
>> <https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-asc>
>> ?
>>
>> Nick
>>
>>
>> 2016년 4월 18일 (월) 오후 10:30, Ted Yu <yuzhih...@gmail.com>님이 작성:
>>
>>> During the months of November / December, the 30 day period should be
>>> relaxed.
>>>
>>> Some people(at least in US) may take extended vacation during that time.
>>>
>>> For Chinese developers, Spring Festival would bear similar circumstance.
>>>
>>> On Mon, Apr 18, 2016 at 7:25 PM, Hyukjin Kwon <gurwls...@gmail.com>
>>> wrote:
>>>
>>>> I also think this might not have to be closed only because it is
>>>> inactive.
>>>>
>>>>
>>>> How about closing issues after 30 days when a committer's comment is
>>>> added a

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu

During the months of November / December, the 30 day period should be
relaxed.

Some people(at least in US) may take extended vacation during that time.

For Chinese developers, Spring Festival would bear similar circumstance.

On Mon, Apr 18, 2016 at 7:25 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> I also think this might not have to be closed only because it is inactive.
>
>
> How about closing issues after 30 days when a committer's comment is added
> at the last without responses from the author?
>
>
> IMHO, If the committers are not sure whether the patch would be useful,
> then I think they should leave some comments why they are not sure, not
> just ignoring.
>
> Or, simply they could ask the author to prove that the patch is useful or
> safe with some references and tests.
>
>
> I think it might be nicer than that users are supposed to keep pinging.
> **Personally**, apparently, I am sometimes a bit worried if pinging
> multiple times can be a bit annoying.
>
>
>
> 2016-04-19 9:56 GMT+09:00 Saisai Shao <sai.sai.s...@gmail.com>:
>
>> It would be better to have a specific technical reason why this PR should
>> be closed, either the implementation is not good or the problem is not
>> valid, or something else. That will actually help the contributor to shape
>> their codes and reopen the PR again. Otherwise reasons like "feel free
>> to reopen for so-and-so reason" is actually discouraging and no difference
>> than directly close the PR.
>>
>> Just my two cents.
>>
>> Thanks
>> Jerry
>>
>>
>> On Tue, Apr 19, 2016 at 4:52 AM, Sean Busbey <bus...@cloudera.com> wrote:
>>
>>> Having a PR closed, especially if due to committers not having hte
>>> bandwidth to check on things, will be very discouraging to new folks.
>>> Doubly so for those inexperienced with opensource. Even if the message
>>> says "feel free to reopen for so-and-so reason", new folks who lack
>>> confidence are going to see reopening as "pestering" and busy folks
>>> are going to see it as a clear indication that their work is not even
>>> valuable enough for a human to give a reason for closing. In either
>>> case, the cost of reopening is substantially higher than that button
>>> press.
>>>
>>> How about we start by keeping a report of "at-risk" PRs that have been
>>> stale for 30 days to make it easier for committers to look at the prs
>>> that have been long inactive?
>>>
>>>
>>> On Mon, Apr 18, 2016 at 2:52 PM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>> > The cost of "reopen" is close to zero, because it is just clicking a
>>> button.
>>> > I think you were referring to the cost of closing the pull request,
>>> and you
>>> > are assuming people look at the pull requests that have been inactive
>>> for a
>>> > long time. That seems equally likely (or unlikely) as committers
>>> looking at
>>> > the recently closed pull requests.
>>> >
>>> > In either case, most pull requests are scanned through by us when they
>>> are
>>> > first open, and if they are important enough, usually they get merged
>>> > quickly or a target version is set in JIRA. We can definitely improve
>>> that
>>> > by making it more explicit.
>>> >
>>> >
>>> >
>>> > On Mon, Apr 18, 2016 at 12:46 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>> >>
>>> >> From committers' perspective, would they look at closed PRs ?
>>> >>
>>> >> If not, the cost is not close to zero.
>>> >> Meaning, some potentially useful PRs would never see the light of day.
>>> >>
>>> >> My two cents.
>>> >>
>>> >> On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>> >>>
>>> >>> Part of it is how difficult it is to automate this. We can build a
>>> >>> perfect engine with a lot of rules that understand everything. But
>>> the more
>>> >>> complicated rules we need, the more unlikely for any of these to
>>> happen. So
>>> >>> I'd rather do this and create a nice enough message to tell
>>> contributors
>>> >>> sometimes mistake happen but the cost to reopen is approximately
>>> zero (i.e.
>>> >>> click a button on the pull request).
>>> >>>

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu

>From committers' perspective, would they look at closed PRs ?

If not, the cost is not close to zero.
Meaning, some potentially useful PRs would never see the light of day.

My two cents.

On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin <r...@databricks.com> wrote:

> Part of it is how difficult it is to automate this. We can build a perfect
> engine with a lot of rules that understand everything. But the more
> complicated rules we need, the more unlikely for any of these to happen. So
> I'd rather do this and create a nice enough message to tell contributors
> sometimes mistake happen but the cost to reopen is approximately zero (i.e.
> click a button on the pull request).
>
>
> On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> bq. close the ones where they don't respond for a week
>>
>> Does this imply that the script understands response from human ?
>>
>> Meaning, would the script use some regex which signifies that the
>> contributor is willing to close the PR ?
>>
>> If the contributor is willing to close, why wouldn't he / she do it
>> him/herself ?
>>
>> On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> Personally I'd rather err on the side of keeping PRs open, but I
>>> understand wanting to keep the open PRs limited to ones which have a
>>> reasonable chance of being merged.
>>>
>>> What about if we filtered for non-mergeable PRs or instead left a
>>> comment asking the author to respond if they are still available to move
>>> the PR forward - and close the ones where they don't respond for a week?
>>>
>>> Just a suggestion.
>>> On Monday, April 18, 2016, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> I had one PR which got merged after 3 months.
>>>>
>>>> If the inactivity was due to contributor, I think it can be closed
>>>> after 30 days.
>>>> But if the inactivity was due to lack of review, the PR should be kept
>>>> open.
>>>>
>>>> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> For what it's worth, I have definitely had PRs that sat inactive for
>>>>> more than 30 days due to committers not having time to look at them,
>>>>> but did eventually end up successfully being merged.
>>>>>
>>>>> I guess if this just ends up being a committer ping and reopening the
>>>>> PR, it's fine, but I don't know if it really addresses the underlying
>>>>> issue.
>>>>>
>>>>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>> > We have hit a new high in open pull requests: 469 today. While we can
>>>>> > certainly get more review bandwidth, many of these are old and still
>>>>> open
>>>>> > for other reasons. Some are stale because the original authors have
>>>>> become
>>>>> > busy and inactive, and some others are stale because the committers
>>>>> are not
>>>>> > sure whether the patch would be useful, but have not rejected the
>>>>> patch
>>>>> > explicitly. We can cut down the signal to noise ratio by closing pull
>>>>> > requests that have been inactive for greater than 30 days, with a
>>>>> nice
>>>>> > message. I just checked and this would close ~ half of the pull
>>>>> requests.
>>>>> >
>>>>> > For example:
>>>>> >
>>>>> > "Thank you for creating this pull request. Since this pull request
>>>>> has been
>>>>> > inactive for 30 days, we are automatically closing it. Closing the
>>>>> pull
>>>>> > request does not remove it from history and will retain all the diff
>>>>> and
>>>>> > review comments. If you have the bandwidth and would like to continue
>>>>> > pushing this forward, please reopen it. Thanks again!"
>>>>> >
>>>>> >
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>
>

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu

bq. close the ones where they don't respond for a week

Does this imply that the script understands response from human ?

Meaning, would the script use some regex which signifies that the
contributor is willing to close the PR ?

If the contributor is willing to close, why wouldn't he / she do it
him/herself ?

On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Personally I'd rather err on the side of keeping PRs open, but I
> understand wanting to keep the open PRs limited to ones which have a
> reasonable chance of being merged.
>
> What about if we filtered for non-mergeable PRs or instead left a comment
> asking the author to respond if they are still available to move the PR
> forward - and close the ones where they don't respond for a week?
>
> Just a suggestion.
> On Monday, April 18, 2016, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> I had one PR which got merged after 3 months.
>>
>> If the inactivity was due to contributor, I think it can be closed after
>> 30 days.
>> But if the inactivity was due to lack of review, the PR should be kept
>> open.
>>
>> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> For what it's worth, I have definitely had PRs that sat inactive for
>>> more than 30 days due to committers not having time to look at them,
>>> but did eventually end up successfully being merged.
>>>
>>> I guess if this just ends up being a committer ping and reopening the
>>> PR, it's fine, but I don't know if it really addresses the underlying
>>> issue.
>>>
>>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>> > We have hit a new high in open pull requests: 469 today. While we can
>>> > certainly get more review bandwidth, many of these are old and still
>>> open
>>> > for other reasons. Some are stale because the original authors have
>>> become
>>> > busy and inactive, and some others are stale because the committers
>>> are not
>>> > sure whether the patch would be useful, but have not rejected the patch
>>> > explicitly. We can cut down the signal to noise ratio by closing pull
>>> > requests that have been inactive for greater than 30 days, with a nice
>>> > message. I just checked and this would close ~ half of the pull
>>> requests.
>>> >
>>> > For example:
>>> >
>>> > "Thank you for creating this pull request. Since this pull request has
>>> been
>>> > inactive for 30 days, we are automatically closing it. Closing the pull
>>> > request does not remove it from history and will retain all the diff
>>> and
>>> > review comments. If you have the bandwidth and would like to continue
>>> > pushing this forward, please reopen it. Thanks again!"
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu

I had one PR which got merged after 3 months.

If the inactivity was due to contributor, I think it can be closed after 30
days.
But if the inactivity was due to lack of review, the PR should be kept open.

On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger  wrote:

> For what it's worth, I have definitely had PRs that sat inactive for
> more than 30 days due to committers not having time to look at them,
> but did eventually end up successfully being merged.
>
> I guess if this just ends up being a committer ping and reopening the
> PR, it's fine, but I don't know if it really addresses the underlying
> issue.
>
> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin  wrote:
> > We have hit a new high in open pull requests: 469 today. While we can
> > certainly get more review bandwidth, many of these are old and still open
> > for other reasons. Some are stale because the original authors have
> become
> > busy and inactive, and some others are stale because the committers are
> not
> > sure whether the patch would be useful, but have not rejected the patch
> > explicitly. We can cut down the signal to noise ratio by closing pull
> > requests that have been inactive for greater than 30 days, with a nice
> > message. I just checked and this would close ~ half of the pull requests.
> >
> > For example:
> >
> > "Thank you for creating this pull request. Since this pull request has
> been
> > inactive for 30 days, we are automatically closing it. Closing the pull
> > request does not remove it from history and will retain all the diff and
> > review comments. If you have the bandwidth and would like to continue
> > pushing this forward, please reopen it. Thanks again!"
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: BytesToBytes and unaligned memory

2016-04-18 Thread Ted Yu

bq. run the tests claiming to require unaligned memory access on a platform
where unaligned memory access is definitely not supported for
shorts/ints/longs.

That would help us understand interactions on s390x platform better.

On Mon, Apr 18, 2016 at 6:49 AM, Adam Roberts <arobe...@uk.ibm.com> wrote:

> Ted, yes with the forced true value all tests pass, we use the unaligned
> check in 15 other suites.
>
> Our java.nio.Bits.unaligned() function checks that the detected os.arch
> value matches a list of known implementations (not including s390x).
>
> We could add it to the known architectures in the catch block but this
> won't make a difference here as because we call unaligned() OK (no
> exception is thrown), we don't reach the architecture checking stage anyway.
>
> I see in org.apache.spark.memory.MemoryManager that unaligned support is
> required for off-heap memory in Tungsten (perhaps incorrectly if no code
> ever exercises it in Spark?). Instead of having a requirement should we
> instead log a warning once that this is likely to lead to slow performance?
> What's the rationale for supporting unaligned memory access: it's my
> understanding that it's typically very slow, are there any design docs or
> perhaps a JIRA where I can learn more?
>
> Will run a simple test case exercising unaligned memory access for Linux
> on Z (without using Spark) and can also run the tests claiming to require
> unaligned memory access on a platform where unaligned memory access is
> definitely not supported for shorts/ints/longs.
>
> if these tests continue to pass then I think the Spark tests don't
> exercise unaligned memory access, cheers
>
>
>
>
>
>
>
> From:Ted Yu <yuzhih...@gmail.com>
> To:Adam Roberts/UK/IBM@IBMGB
> Cc:"dev@spark.apache.org" <dev@spark.apache.org>
> Date:15/04/2016 17:35
> Subject:Re: BytesToBytes and unaligned memory
> --
>
>
>
> I am curious if all Spark unit tests pass with the forced true value for
> unaligned.
> If that is the case, it seems we can add s390x to the known architectures.
>
> It would also give us some more background if you can describe
> how java.nio.Bits#unaligned() is implemented on s390x.
>
> Josh / Andrew / Davies / Ryan are more familiar with related code. It
> would be good to hear what they think.
>
> Thanks
>
> On Fri, Apr 15, 2016 at 8:47 AM, Adam Roberts <*arobe...@uk.ibm.com*
> <arobe...@uk.ibm.com>> wrote:
> Ted, yeah with the forced true value the tests in that suite all pass and
> I know they're being executed thanks to prints I've added
>
> Cheers,
>
>
>
>
> From:Ted Yu <*yuzhih...@gmail.com* <yuzhih...@gmail.com>>
> To:Adam Roberts/UK/IBM@IBMGB
> Cc:"*dev@spark.apache.org* <dev@spark.apache.org>" <
> *dev@spark.apache.org* <dev@spark.apache.org>>
> Date:15/04/2016 16:43
> Subject:Re: BytesToBytes and unaligned memory
> --
>
>
>
> Can you clarify whether BytesToBytesMapOffHeapSuite passed or failed with
> the forced true value for unaligned ?
>
> If the test failed, please pastebin the failure(s).
>
> Thanks
>
> On Fri, Apr 15, 2016 at 8:32 AM, Adam Roberts <*arobe...@uk.ibm.com*
> <arobe...@uk.ibm.com>> wrote:
> Ted, yep I'm working from the latest code which includes that unaligned
> check, for experimenting I've modified that code to ignore the unaligned
> check (just go ahead and say we support it anyway, even though our JDK
> returns false: the return value of java.nio.Bits.unaligned()).
>
> My Platform.java for testing contains:
>
> private static final boolean unaligned;
>
> static {
>   boolean _unaligned;
>   // use reflection to access unaligned field
>   try {
> * System.out.println("Checking unaligned support");*
> Class bitsClass =
>   Class.forName("java.nio.Bits", false,
> ClassLoader.getSystemClassLoader());
> Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned");
> unalignedMethod.setAccessible(true);
> _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null));
> *System.out.println("Used reflection and _unaligned is: " +
> _unaligned);*
> * System.out.println("Setting to true anyway for experimenting");*
> * _unaligned = true;*
> } catch (Throwable t) {
>   // We at least know x86 and x64 support unaligned access.
>   String arch = System.getProperty("os.arch", "");
>   //noinspection DynamicRegexReplaceableByCompiledPattern
> *   // We don't actual

Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu

I am curious if all Spark unit tests pass with the forced true value for
unaligned.
If that is the case, it seems we can add s390x to the known architectures.

It would also give us some more background if you can describe how
java.nio.Bits#unaligned()
is implemented on s390x.

Josh / Andrew / Davies / Ryan are more familiar with related code. It would
be good to hear what they think.

Thanks

On Fri, Apr 15, 2016 at 8:47 AM, Adam Roberts <arobe...@uk.ibm.com> wrote:

> Ted, yeah with the forced true value the tests in that suite all pass and
> I know they're being executed thanks to prints I've added
>
> Cheers,
>
>
>
>
> From:Ted Yu <yuzhih...@gmail.com>
> To:Adam Roberts/UK/IBM@IBMGB
> Cc:"dev@spark.apache.org" <dev@spark.apache.org>
> Date:15/04/2016 16:43
> Subject:Re: BytesToBytes and unaligned memory
> --
>
>
>
> Can you clarify whether BytesToBytesMapOffHeapSuite passed or failed with
> the forced true value for unaligned ?
>
> If the test failed, please pastebin the failure(s).
>
> Thanks
>
> On Fri, Apr 15, 2016 at 8:32 AM, Adam Roberts <*arobe...@uk.ibm.com*
> <arobe...@uk.ibm.com>> wrote:
> Ted, yep I'm working from the latest code which includes that unaligned
> check, for experimenting I've modified that code to ignore the unaligned
> check (just go ahead and say we support it anyway, even though our JDK
> returns false: the return value of java.nio.Bits.unaligned()).
>
> My Platform.java for testing contains:
>
> private static final boolean unaligned;
>
> static {
>   boolean _unaligned;
>   // use reflection to access unaligned field
>   try {
> * System.out.println("Checking unaligned support");*
> Class bitsClass =
>   Class.forName("java.nio.Bits", false,
> ClassLoader.getSystemClassLoader());
> Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned");
> unalignedMethod.setAccessible(true);
> _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null));
> *System.out.println("Used reflection and _unaligned is: " +
> _unaligned);*
> * System.out.println("Setting to true anyway for experimenting");*
> * _unaligned = true;*
> } catch (Throwable t) {
>   // We at least know x86 and x64 support unaligned access.
>   String arch = System.getProperty("os.arch", "");
>   //noinspection DynamicRegexReplaceableByCompiledPattern
> *   // We don't actually get here since we find the unaligned method
> OK and it returns false (I override with true anyway)*
> *   // but add s390x incase we somehow fail anyway.*
> *   System.out.println("Checking for s390x, os.arch is: " + arch);*
> *   _unaligned =
> arch.matches("^(i[3-6]86|x86(_64)?|x64|s390x|amd64)$");*
> }
> unaligned = _unaligned;
> *     System.out.println("returning: " + unaligned);*
>   }
> }
>
> Output is, as you'd expect, "used reflection and _unaligned is false,
> setting to true anyway for experimenting", and the tests pass.
>
> No other problems on the platform (pending a different pull request).
>
> Cheers,
>
>
>
>
>
>
>
> From:Ted Yu <*yuzhih...@gmail.com* <yuzhih...@gmail.com>>
> To:Adam Roberts/UK/IBM@IBMGB
> Cc:"*dev@spark.apache.org* <dev@spark.apache.org>" <
> *dev@spark.apache.org* <dev@spark.apache.org>>
> Date:15/04/2016 15:32
> Subject:Re: BytesToBytes and unaligned memory
> --
>
>
>
>
> I assume you tested 2.0 with SPARK-12181 .
>
> Related code from Platform.java if java.nio.Bits#unaligned() throws
> exception:
>
>   // We at least know x86 and x64 support unaligned access.
>   String arch = System.getProperty("os.arch", "");
>   //noinspection DynamicRegexReplaceableByCompiledPattern
>   _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|amd64)$");
>
> Can you give us some detail on how the code runs for JDKs on zSystems ?
>
> Thanks
>
> On Fri, Apr 15, 2016 at 7:01 AM, Adam Roberts <*arobe...@uk.ibm.com*
> <arobe...@uk.ibm.com>> wrote:
> Hi, I'm testing Spark 2.0.0 on various architectures and have a question,
> are we sure if
> *core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java*
> <https://github.com/apache/spark/blob/96941b12f8b465df21423275f3cd3ade579b4fa1/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java>
> really is attempting to use unaligned memory access

Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu

Can you clarify whether BytesToBytesMapOffHeapSuite passed or failed with
the forced true value for unaligned ?

If the test failed, please pastebin the failure(s).

Thanks

On Fri, Apr 15, 2016 at 8:32 AM, Adam Roberts <arobe...@uk.ibm.com> wrote:

> Ted, yep I'm working from the latest code which includes that unaligned
> check, for experimenting I've modified that code to ignore the unaligned
> check (just go ahead and say we support it anyway, even though our JDK
> returns false: the return value of java.nio.Bits.unaligned()).
>
> My Platform.java for testing contains:
>
> private static final boolean unaligned;
>
> static {
>   boolean _unaligned;
>   // use reflection to access unaligned field
>   try {
> *System.out.println("Checking unaligned support");*
> Class bitsClass =
>   Class.forName("java.nio.Bits", false,
> ClassLoader.getSystemClassLoader());
> Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned");
> unalignedMethod.setAccessible(true);
> _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null));
> *System.out.println("Used reflection and _unaligned is: " +
> _unaligned);*
> *System.out.println("Setting to true anyway for experimenting");*
> *_unaligned = true;*
> } catch (Throwable t) {
>   // We at least know x86 and x64 support unaligned access.
>   String arch = System.getProperty("os.arch", "");
>   //noinspection DynamicRegexReplaceableByCompiledPattern
> *  // We don't actually get here since we find the unaligned method OK
> and it returns false (I override with true anyway)*
> *  // but add s390x incase we somehow fail anyway.*
> *  System.out.println("Checking for s390x, os.arch is: " + arch);*
> *  _unaligned =
> arch.matches("^(i[3-6]86|x86(_64)?|x64|s390x|amd64)$");*
> }
> unaligned = _unaligned;
> *System.out.println("returning: " + unaligned);*
>   }
> }
>
> Output is, as you'd expect, "used reflection and _unaligned is false,
> setting to true anyway for experimenting", and the tests pass.
>
> No other problems on the platform (pending a different pull request).
>
> Cheers,
>
>
>
>
>
>
>
> From:Ted Yu <yuzhih...@gmail.com>
> To:Adam Roberts/UK/IBM@IBMGB
> Cc:"dev@spark.apache.org" <dev@spark.apache.org>
> Date:15/04/2016 15:32
> Subject:Re: BytesToBytes and unaligned memory
> --
>
>
>
> I assume you tested 2.0 with SPARK-12181 .
>
> Related code from Platform.java if java.nio.Bits#unaligned() throws
> exception:
>
>   // We at least know x86 and x64 support unaligned access.
>   String arch = System.getProperty("os.arch", "");
>   //noinspection DynamicRegexReplaceableByCompiledPattern
>   _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|amd64)$");
>
> Can you give us some detail on how the code runs for JDKs on zSystems ?
>
> Thanks
>
> On Fri, Apr 15, 2016 at 7:01 AM, Adam Roberts <*arobe...@uk.ibm.com*
> <arobe...@uk.ibm.com>> wrote:
> Hi, I'm testing Spark 2.0.0 on various architectures and have a question,
> are we sure if
> *core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java*
> <https://github.com/apache/spark/blob/96941b12f8b465df21423275f3cd3ade579b4fa1/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java>
> really is attempting to use unaligned memory access (for the
> BytesToBytesMapOffHeapSuite tests specifically)?
>
> Our JDKs on zSystems for example return false for the
> java.nio.Bits.unaligned() method and yet if I skip this check and add s390x
> to the supported architectures (for zSystems), all thirteen tests here
> pass.
>
> The 13 tests here all fail as we do not pass the unaligned requirement
> (but perhaps incorrectly):
>
> *core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java*
> <https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java>
> and I know the unaligned checking is at
> *common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java*
> <https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java>
>
> Either our JDK's method is returning false incorrectly or this test isn't
> using unaligned memory access (so the requirement is invalid), there's no
> mention of alignment in the test itself.
>
> Any guidance would be very much appreciated, cheers
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>

Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu

I assume you tested 2.0 with SPARK-12181 .

Related code from Platform.java if java.nio.Bits#unaligned() throws
exception:

  // We at least know x86 and x64 support unaligned access.
  String arch = System.getProperty("os.arch", "");
  //noinspection DynamicRegexReplaceableByCompiledPattern
  _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|amd64)$");

Can you give us some detail on how the code runs for JDKs on zSystems ?

Thanks

On Fri, Apr 15, 2016 at 7:01 AM, Adam Roberts  wrote:

> Hi, I'm testing Spark 2.0.0 on various architectures and have a question,
> are we sure if
> core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
> 
> really is attempting to use unaligned memory access (for the
> BytesToBytesMapOffHeapSuite tests specifically)?
>
> Our JDKs on zSystems for example return false for the
> java.nio.Bits.unaligned() method and yet if I skip this check and add s390x
> to the supported architectures (for zSystems), all thirteen tests here
> pass.
>
> The 13 tests here all fail as we do not pass the unaligned requirement
> (but perhaps incorrectly):
>
> core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java
> 
> and I know the unaligned checking is at
> common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
> 
>
> Either our JDK's method is returning false incorrectly or this test isn't
> using unaligned memory access (so the requirement is invalid), there's no
> mention of alignment in the test itself.
>
> Any guidance would be very much appreciated, cheers
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-11 Thread Ted Yu

Gentle ping: spark-1.6.1-bin-hadoop2.4.tgz from S3 is still corrupt.

On Wed, Apr 6, 2016 at 12:55 PM, Josh Rosen <joshro...@databricks.com>
wrote:

> Sure, I'll take a look. Planning to do full verification in a bit.
>
> On Wed, Apr 6, 2016 at 12:54 PM Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Josh:
>> Can you check spark-1.6.1-bin-hadoop2.4.tgz ?
>>
>> $ tar zxf spark-1.6.1-bin-hadoop2.4.tgz
>>
>> gzip: stdin: not in gzip format
>> tar: Child returned status 1
>> tar: Error is not recoverable: exiting now
>>
>> $ ls -l !$
>> ls -l spark-1.6.1-bin-hadoop2.4.tgz
>> -rw-r--r--. 1 hbase hadoop 323614720 Apr  5 19:25
>> spark-1.6.1-bin-hadoop2.4.tgz
>>
>> Thanks
>>
>> On Wed, Apr 6, 2016 at 12:19 PM, Josh Rosen <joshro...@databricks.com>
>> wrote:
>>
>>> I downloaded the Spark 1.6.1 artifacts from the Apache mirror network
>>> and re-uploaded them to the spark-related-packages S3 bucket, so hopefully
>>> these packages should be fixed now.
>>>
>>> On Mon, Apr 4, 2016 at 3:37 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Thanks, that was the command. :thumbsup:
>>>>
>>>> On Mon, Apr 4, 2016 at 6:28 PM Jakob Odersky <ja...@odersky.com> wrote:
>>>>
>>>>> I just found out how the hash is calculated:
>>>>>
>>>>> gpg --print-md sha512 .tgz
>>>>>
>>>>> you can use that to check if the resulting output matches the contents
>>>>> of .tgz.sha
>>>>>
>>>>> On Mon, Apr 4, 2016 at 3:19 PM, Jakob Odersky <ja...@odersky.com>
>>>>> wrote:
>>>>> > The published hash is a SHA512.
>>>>> >
>>>>> > You can verify the integrity of the packages by running `sha512sum`
>>>>> on
>>>>> > the archive and comparing the computed hash with the published one.
>>>>> > Unfortunately however, I don't know what tool is used to generate the
>>>>> > hash and I can't reproduce the format, so I ended up manually
>>>>> > comparing the hashes.
>>>>> >
>>>>> > On Mon, Apr 4, 2016 at 2:39 PM, Nicholas Chammas
>>>>> > <nicholas.cham...@gmail.com> wrote:
>>>>> >> An additional note: The Spark packages being served off of
>>>>> CloudFront (i.e.
>>>>> >> the “direct download” option on spark.apache.org) are also corrupt.
>>>>> >>
>>>>> >> Btw what’s the correct way to verify the SHA of a Spark package?
>>>>> I’ve tried
>>>>> >> a few commands on working packages downloaded from Apache mirrors,
>>>>> but I
>>>>> >> can’t seem to reproduce the published SHA for
>>>>> spark-1.6.1-bin-hadoop2.6.tgz.
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Apr 4, 2016 at 11:45 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>>>> >>>
>>>>> >>> Maybe temporarily take out the artifacts on S3 before the root
>>>>> cause is
>>>>> >>> found.
>>>>> >>>
>>>>> >>> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas
>>>>> >>> <nicholas.cham...@gmail.com> wrote:
>>>>> >>>>
>>>>> >>>> Just checking in on this again as the builds on S3 are still
>>>>> broken. :/
>>>>> >>>>
>>>>> >>>> Could it have something to do with us moving release-build.sh?
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas
>>>>> >>>> <nicholas.cham...@gmail.com> wrote:
>>>>> >>>>>
>>>>> >>>>> Is someone going to retry fixing these packages? It's still a
>>>>> problem.
>>>>> >>>>>
>>>>> >>>>> Also, it would be good to understand why this is happening.
>>>>> >>>>>
>>>>> >>>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com>
>>>>> wrote:
>>>>> >>>>>>
>>>>> >>>>>> I just realized you're using a different download site. Sorry
>>>&

Re: spark graphx storage RDD memory leak

2016-04-10 Thread Ted Yu

I see the following code toward the end of the method:

  // Unpersist the RDDs hidden by newly-materialized RDDs
  oldMessages.unpersist(blocking = false)
  prevG.unpersistVertices(blocking = false)
  prevG.edges.unpersist(blocking = false)

Wouldn't the above achieve same effect ?

On Sun, Apr 10, 2016 at 9:08 AM, zhang juntao 
wrote:

> hi experts,
>
> I’m reporting a problem about spark graphx, I use zeppelin submit spark
> jobs,
> note that scala environment shares the same SparkContext, SQLContext
> instance,
> and I call  Connected components algorithm to do some Business,
> found that every time when the job finished, some graph storage RDDs
> weren’t bean released,
> after several times there would be a lot of  storage RDDs existing even
> through all the jobs have finished .
>
>
> So I check the code of connectedComponents  and find that may be a problem
> in *Pregel.scala* .
> when param graph has been cached, there isn’t any way to unpersist,
> so I add red font code to solve the problem
>
>
>
>
>
>
>
>
>
>
> *def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]   (graph: Graph[VD, ED],  
>   initialMsg: A,maxIterations: Int = Int.MaxValue,activeDirection: 
> EdgeDirection = EdgeDirection.Either)   (vprog: (VertexId, VD, A) => VD,
> sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],mergeMsg: (A, A) 
> => A)  : Graph[VD, ED] ={*
>
>
> *  ..*
>
>
>
> *  var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
> initialMsg)).cache()  graph.unpersistVertices(blocking = false)  
> graph.edges.unpersist(blocking = false)*
>
> * ..*
>
>
> *} // end of apply*
>
>
> I'm not sure if this is a bug, and thank you for your time, juntao
>
>
>

Re: [BUILD FAILURE] Spark Project ML Local Library - me or it's real?

2016-04-09 Thread Ted Yu

Sent PR:
https://github.com/apache/spark/pull/12276

I was able to get build going past mllib-local module.

FYI

On Sat, Apr 9, 2016 at 12:40 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> The broken build was caused by the following:
>
> [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom
>
> See
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/607/
>
> FYI
>
> On Sat, Apr 9, 2016 at 12:01 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> Is this me or the build is broken today? I'm looking for help as it looks
>> scary.
>>
>> $ ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.2 -Phive
>> -Phive-thriftserver -DskipTests clean install
>> [INFO] --- scala-maven-plugin:3.2.2:testCompile
>> (scala-test-compile-first) @ spark-mllib-local_2.11 ---
>> [INFO] Using zinc server for incremental compilation
>> [warn] Pruning sources from previous analysis, due to incompatible
>> CompileSetup.
>> [info] Compiling 1 Scala source to
>> /Users/jacek/dev/oss/spark/mllib-local/target/scala-2.11/test-classes...
>> [error] missing or invalid dependency detected while loading class
>> file 'SparkFunSuite.class'.
>> [error] Could not access type Logging in package
>> org.apache.spark.internal,
>> [error] because it (or its dependencies) are missing. Check your build
>> definition for
>> [error] missing or conflicting dependencies. (Re-run with
>> `-Ylog-classpath` to see the problematic classpath.)
>> [error] A full rebuild may help if 'SparkFunSuite.class' was compiled
>> against an incompatible version of org.apache.spark.internal.
>> [error] one error found
>> [error] Compile failed at Apr 9, 2016 2:27:30 PM [0.475s]
>> [INFO]
>> 
>> [INFO] Reactor Summary:
>> [INFO]
>> [INFO] Spark Project Parent POM ... SUCCESS [
>> 4.338 s]
>> [INFO] Spark Project Test Tags  SUCCESS [
>> 5.238 s]
>> [INFO] Spark Project Sketch ... SUCCESS [
>> 6.158 s]
>> [INFO] Spark Project Networking ... SUCCESS [
>> 10.397 s]
>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>> 7.263 s]
>> [INFO] Spark Project Unsafe ... SUCCESS [
>> 10.448 s]
>> [INFO] Spark Project Launcher . SUCCESS [
>> 11.028 s]
>> [INFO] Spark Project Core . SUCCESS
>> [02:04 min]
>> [INFO] Spark Project GraphX ... SUCCESS [
>> 16.973 s]
>> [INFO] Spark Project Streaming  SUCCESS [
>> 38.458 s]
>> [INFO] Spark Project Catalyst . SUCCESS
>> [01:18 min]
>> [INFO] Spark Project SQL .. SUCCESS
>> [01:24 min]
>> [INFO] Spark Project ML Local Library . FAILURE [
>> 1.083 s]
>> [INFO] Spark Project ML Library ... SKIPPED
>> [INFO] Spark Project Tools  SKIPPED
>> [INFO] Spark Project Hive . SKIPPED
>> [INFO] Spark Project Docker Integration Tests . SKIPPED
>> [INFO] Spark Project REPL . SKIPPED
>> [INFO] Spark Project YARN Shuffle Service . SKIPPED
>> [INFO] Spark Project YARN . SKIPPED
>> [INFO] Spark Project Hive Thrift Server ... SKIPPED
>> [INFO] Spark Project Assembly . SKIPPED
>> [INFO] Spark Project External Flume Sink .. SKIPPED
>> [INFO] Spark Project External Flume ... SKIPPED
>> [INFO] Spark Project External Flume Assembly .. SKIPPED
>> [INFO] Spark Project External Kafka ... SKIPPED
>> [INFO] Spark Project Examples . SKIPPED
>> [INFO] Spark Project External Kafka Assembly .. SKIPPED
>> [INFO] Spark Project Java 8 Tests . SKIPPED
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>> [INFO] Total time: 06:39 min
>> [INFO] Finished at: 2016-04-09T14:27:30-04:00
>> [INFO] Final Memory: 79M/893M
>> [INFO]
>> 
>> [

Re: [BUILD FAILURE] Spark Project ML Local Library - me or it's real?

2016-04-09 Thread Ted Yu

The broken build was caused by the following:

[SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom

See
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/607/

FYI

On Sat, Apr 9, 2016 at 12:01 PM, Jacek Laskowski  wrote:

> Hi,
>
> Is this me or the build is broken today? I'm looking for help as it looks
> scary.
>
> $ ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.2 -Phive
> -Phive-thriftserver -DskipTests clean install
> [INFO] --- scala-maven-plugin:3.2.2:testCompile
> (scala-test-compile-first) @ spark-mllib-local_2.11 ---
> [INFO] Using zinc server for incremental compilation
> [warn] Pruning sources from previous analysis, due to incompatible
> CompileSetup.
> [info] Compiling 1 Scala source to
> /Users/jacek/dev/oss/spark/mllib-local/target/scala-2.11/test-classes...
> [error] missing or invalid dependency detected while loading class
> file 'SparkFunSuite.class'.
> [error] Could not access type Logging in package org.apache.spark.internal,
> [error] because it (or its dependencies) are missing. Check your build
> definition for
> [error] missing or conflicting dependencies. (Re-run with
> `-Ylog-classpath` to see the problematic classpath.)
> [error] A full rebuild may help if 'SparkFunSuite.class' was compiled
> against an incompatible version of org.apache.spark.internal.
> [error] one error found
> [error] Compile failed at Apr 9, 2016 2:27:30 PM [0.475s]
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [
> 4.338 s]
> [INFO] Spark Project Test Tags  SUCCESS [
> 5.238 s]
> [INFO] Spark Project Sketch ... SUCCESS [
> 6.158 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 10.397 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 7.263 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 10.448 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 11.028 s]
> [INFO] Spark Project Core . SUCCESS [02:04
> min]
> [INFO] Spark Project GraphX ... SUCCESS [
> 16.973 s]
> [INFO] Spark Project Streaming  SUCCESS [
> 38.458 s]
> [INFO] Spark Project Catalyst . SUCCESS [01:18
> min]
> [INFO] Spark Project SQL .. SUCCESS [01:24
> min]
> [INFO] Spark Project ML Local Library . FAILURE [
> 1.083 s]
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project Docker Integration Tests . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project YARN Shuffle Service . SKIPPED
> [INFO] Spark Project YARN . SKIPPED
> [INFO] Spark Project Hive Thrift Server ... SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Spark Project External Flume Sink .. SKIPPED
> [INFO] Spark Project External Flume ... SKIPPED
> [INFO] Spark Project External Flume Assembly .. SKIPPED
> [INFO] Spark Project External Kafka ... SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
> [INFO] Spark Project Java 8 Tests . SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 06:39 min
> [INFO] Finished at: 2016-04-09T14:27:30-04:00
> [INFO] Final Memory: 79M/893M
> [INFO]
> 
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
> (scala-test-compile-first) on project spark-mllib-local_2.11:
> Execution scala-test-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.
> CompileFailed -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with
> the -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :spark-mllib-local_2.11
>
>

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-06 Thread Ted Yu

Josh:
Can you check spark-1.6.1-bin-hadoop2.4.tgz ?

$ tar zxf spark-1.6.1-bin-hadoop2.4.tgz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

$ ls -l !$
ls -l spark-1.6.1-bin-hadoop2.4.tgz
-rw-r--r--. 1 hbase hadoop 323614720 Apr  5 19:25
spark-1.6.1-bin-hadoop2.4.tgz

Thanks

On Wed, Apr 6, 2016 at 12:19 PM, Josh Rosen <joshro...@databricks.com>
wrote:

> I downloaded the Spark 1.6.1 artifacts from the Apache mirror network and
> re-uploaded them to the spark-related-packages S3 bucket, so hopefully
> these packages should be fixed now.
>
> On Mon, Apr 4, 2016 at 3:37 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks, that was the command. :thumbsup:
>>
>> On Mon, Apr 4, 2016 at 6:28 PM Jakob Odersky <ja...@odersky.com> wrote:
>>
>>> I just found out how the hash is calculated:
>>>
>>> gpg --print-md sha512 .tgz
>>>
>>> you can use that to check if the resulting output matches the contents
>>> of .tgz.sha
>>>
>>> On Mon, Apr 4, 2016 at 3:19 PM, Jakob Odersky <ja...@odersky.com> wrote:
>>> > The published hash is a SHA512.
>>> >
>>> > You can verify the integrity of the packages by running `sha512sum` on
>>> > the archive and comparing the computed hash with the published one.
>>> > Unfortunately however, I don't know what tool is used to generate the
>>> > hash and I can't reproduce the format, so I ended up manually
>>> > comparing the hashes.
>>> >
>>> > On Mon, Apr 4, 2016 at 2:39 PM, Nicholas Chammas
>>> > <nicholas.cham...@gmail.com> wrote:
>>> >> An additional note: The Spark packages being served off of CloudFront
>>> (i.e.
>>> >> the “direct download” option on spark.apache.org) are also corrupt.
>>> >>
>>> >> Btw what’s the correct way to verify the SHA of a Spark package? I’ve
>>> tried
>>> >> a few commands on working packages downloaded from Apache mirrors,
>>> but I
>>> >> can’t seem to reproduce the published SHA for
>>> spark-1.6.1-bin-hadoop2.6.tgz.
>>> >>
>>> >>
>>> >> On Mon, Apr 4, 2016 at 11:45 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>> >>>
>>> >>> Maybe temporarily take out the artifacts on S3 before the root cause
>>> is
>>> >>> found.
>>> >>>
>>> >>> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas
>>> >>> <nicholas.cham...@gmail.com> wrote:
>>> >>>>
>>> >>>> Just checking in on this again as the builds on S3 are still
>>> broken. :/
>>> >>>>
>>> >>>> Could it have something to do with us moving release-build.sh?
>>> >>>>
>>> >>>>
>>> >>>> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas
>>> >>>> <nicholas.cham...@gmail.com> wrote:
>>> >>>>>
>>> >>>>> Is someone going to retry fixing these packages? It's still a
>>> problem.
>>> >>>>>
>>> >>>>> Also, it would be good to understand why this is happening.
>>> >>>>>
>>> >>>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com>
>>> wrote:
>>> >>>>>>
>>> >>>>>> I just realized you're using a different download site. Sorry for
>>> the
>>> >>>>>> confusion, the link I get for a direct download of Spark 1.6.1 /
>>> >>>>>> Hadoop 2.6 is
>>> >>>>>>
>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>> >>>>>>
>>> >>>>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>>> >>>>>> <nicholas.cham...@gmail.com> wrote:
>>> >>>>>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a
>>> >>>>>> > corrupt ZIP
>>> >>>>>> > file.
>>> >>>>>> >
>>> >>>>>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it
>>> the same
>>> >>>>>> > Spark
>>> >>>>>> > 1.6.1/Hadoop 2.6 package you had a success with?
>>> >>>>>> >
>>> &

Re: BROKEN BUILD? Is this only me or not?

2016-04-05 Thread Ted Yu

Probably related to Java 8.

I used :

$ java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

On Tue, Apr 5, 2016 at 6:32 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Ted,
>
> This is a similar issue
> https://issues.apache.org/jira/browse/SPARK-12530. I've fixed today's
> one and am sending a pull req.
>
> My build command is as follows:
>
> ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.2 -Phive
> -Phive-thriftserver -DskipTests clean install
>
> I'm on Java 8 / Mac OS X
>
> ➜  spark git:(master) ✗ java -version
> java version "1.8.0_77"
> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Apr 5, 2016 at 8:41 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> > Looking at recent
> >
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7
> > builds, there was no such error.
> > I don't see anything wrong with the code:
> >
> >   usage = "_FUNC_(str) - " +
> > "Returns str, with the first letter of each word in uppercase, all
> other
> > letters in " +
> >
> > Mind refresh and build again ?
> >
> > If it still fails, please share the build command.
> >
> > On Tue, Apr 5, 2016 at 4:51 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> >>
> >> Hi,
> >>
> >> Just checked out the latest sources and got this...
> >>
> >>
> >>
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626:
> >> error: annotation argument needs to be a constant; found: "_FUNC_(str)
> >> - ".+("Returns str, with the first letter of each word in uppercase,
> >> all other letters in ").+("lowercase. Words are delimited by white
> >> space.")
> >> "Returns str, with the first letter of each word in uppercase, all
> >> other letters in " +
> >>
> >>^
> >>
> >> It's in
> >>
> https://github.com/apache/spark/commit/c59abad052b7beec4ef550049413e95578e545be
> .
> >>
> >> Is this a real issue with the build now or is this just me? I may have
> >> seen a similar case before, but can't remember what the fix was.
> >> Looking into it.
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >
>

Re: BROKEN BUILD? Is this only me or not?

2016-04-05 Thread Ted Yu

Looking at recent
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7
builds, there was no such error.
I don't see anything wrong with the code:

  usage = "_FUNC_(str) - " +
"Returns str, with the first letter of each word in uppercase, all
other letters in " +

Mind refresh and build again ?

If it still fails, please share the build command.

On Tue, Apr 5, 2016 at 4:51 PM, Jacek Laskowski  wrote:

> Hi,
>
> Just checked out the latest sources and got this...
>
>
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626:
> error: annotation argument needs to be a constant; found: "_FUNC_(str)
> - ".+("Returns str, with the first letter of each word in uppercase,
> all other letters in ").+("lowercase. Words are delimited by white
> space.")
> "Returns str, with the first letter of each word in uppercase, all
> other letters in " +
>
>^
>
> It's in
> https://github.com/apache/spark/commit/c59abad052b7beec4ef550049413e95578e545be
> .
>
> Is this a real issue with the build now or is this just me? I may have
> seen a similar case before, but can't remember what the fix was.
> Looking into it.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Ted Yu

Josh:
You may have noticed the following error (
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/566/console
):

[error] javac: invalid source release: 1.8
[error] Usage: javac  
[error] use -help for a list of possible options


On Tue, Apr 5, 2016 at 2:14 PM, Josh Rosen  wrote:

> In order to be able to run Java 8 API compatibility tests, I'm going to
> push a new set of Jenkins configurations for Spark's test and PR builders
> so that those jobs use a Java 8 JDK. I tried this once in the past and it
> seemed to introduce some rare, transient flakiness in certain tests, so if
> anyone observes new test failures please email me and I'll investigate
> right away.
>
> Note that this change has no impact on Spark's supported JDK versions and
> our build will still target Java 7 and emit Java 7 bytecode; the purpose of
> this change is simply to allow the Java 8 lambda tests to be run as part of
> PR builder runs.
>
> - Josh
>

Re: [STREAMING] DStreamClosureSuite.scala with { return; ssc.sparkContext.emptyRDD[Int] } Why?!

2016-04-05 Thread Ted Yu

The next line should give some clue:
expectCorrectException { ssc.transform(Seq(ds), transformF) }

Closure shouldn't include return.

On Tue, Apr 5, 2016 at 3:40 PM, Jacek Laskowski  wrote:

> Hi,
>
> In
> https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/DStreamClosureSuite.scala#L190
> :
>
> { return; ssc.sparkContext.emptyRDD[Int] }
>
> What is this return inside for? I don't understand the line and am
> about to propose a change to remove it.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Build with Thrift Server & Scala 2.11

2016-04-05 Thread Ted Yu

Raymond:

Did "namenode" appear in any of the Spark config files ?

BTW Scala 2.11 is used by the default build.

On Tue, Apr 5, 2016 at 6:22 AM, Raymond Honderdors <
raymond.honderd...@sizmek.com> wrote:

> I can see that the build is successful
>
> (-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
> –Dscala-2.11 -DskipTests clean package)
>
>
>
> the documents page it still says that
>
> “
>
> Building With Hive and JDBC Support
>
> To enable Hive integration for Spark SQL along with its JDBC server and
> CLI, add the -Phive and Phive-thriftserver profiles to your existing build
> options. By default Spark will build with Hive 0.13.1 bindings.
>
>
>
> # Apache Hadoop 2.4.X with Hive 13 support
>
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver
> -DskipTests clean package
>
> Building for Scala 2.11
>
> To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11
> property:
>
>
>
> ./dev/change-scala-version.sh 2.11
>
> mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
>
> Spark does not yet support its JDBC component for Scala 2.11.
>
> ”
>
> Source : http://spark.apache.org/docs/latest/building-spark.html
>
>
>
> When I try to start the thrift server I get the following error:
>
> “
>
> 16/04/05 16:09:11 INFO BlockManagerMaster: Registered BlockManager
>
> 16/04/05 16:09:12 ERROR SparkContext: Error initializing SparkContext.
>
> java.lang.IllegalArgumentException: java.net.UnknownHostException: namenode
>
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
>
> at
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
>
> at
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
>
> at
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
>
> at
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
>
> at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
>
> at
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>
> at
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
>
> at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>
> at
> org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1667)
>
> at
> org.apache.spark.scheduler.EventLoggingListener.(EventLoggingListener.scala:67)
>
> at
> org.apache.spark.SparkContext.(SparkContext.scala:517)
>
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:57)
>
> at
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:77)
>
> at
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:726)
>
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
>
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
>
> at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
>
> at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: java.net.UnknownHostException: namenode
>
> ... 26 more
>
> 16/04/05 16:09:12 INFO SparkUI: Stopped Spark web UI at
> http://10.10.182.195:4040
>
> 16/04/05 16:09:12 INFO SparkDeploySchedulerBackend: Shutting down all
> executors
>
> ”
>
>
>
>
>
>
>
> *Raymond Honderdors *
>
> *Team Lead Analytics BI*
>
> *Business Intelligence Developer *
>
> *raymond.honderd...@sizmek.com  *
>
> *T +972.7325.3569*
>
> *Herzliya*
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Tuesday, April 05, 2016 3:57 PM
> *To:* Raymond Honderdors 
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Build with Thrift Server & Scala 2.11
>
>
>
> What do you mean? The Jenkins build for Spark uses 2.11 and also builds
> the thrift server.
>
> On Tuesday, April 5, 2016, Raymond Honderdors <
> raymond.honderd...@sizmek.com> wrote:
>
> Is anyone looking into this one, Build with Thrift Server &

Re: error: reference to sql is ambiguous after import org.apache.spark._ in shell?

2016-04-04 Thread Ted Yu

Looks like the import comes from
repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala :

  processLine("import sqlContext.sql")

On Mon, Apr 4, 2016 at 5:16 PM, Jacek Laskowski  wrote:

> Hi Spark devs,
>
> I'm unsure if what I'm seeing is correct. I'd appreciate any input
> to...rest my nerves :-) I did `import org.apache.spark._` by mistake,
> but since it's valid, I'm wondering why does Spark shell imports sql
> at all since it's available after the import?!
>
> (it's today's build)
>
> scala> sql("SELECT * FROM dafa").show(false)
> :30: error: reference to sql is ambiguous;
> it is imported twice in the same scope by
> import org.apache.spark._
> and import sqlContext.sql
>sql("SELECT * FROM dafa").show(false)
>^
>
> scala> :imports
>  1) import sqlContext.implicits._  (52 terms, 31 are implicit)
>  2) import sqlContext.sql  (1 terms)
>
> scala> sc.version
> res19: String = 2.0.0-SNAPSHOT
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: explain codegen

2016-04-04 Thread Ted Yu

Thanks to all who have responded.

It turned out that the following command line for maven caused the error (I
forgot to include this in first email):
eclipse:eclipse

Once I omitted the above, 'explain codegen' works.

On Mon, Apr 4, 2016 at 9:37 AM, Reynold Xin <r...@databricks.com> wrote:

> Why don't you wipe everything out and try again?
>
> On Monday, April 4, 2016, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> The commit you mentioned was made Friday.
>> I refreshed workspace Sunday - so it was included.
>>
>> Maybe this was related:
>>
>> $ bin/spark-shell
>> Failed to find Spark jars directory
>> (/home/hbase/spark/assembly/target/scala-2.10).
>> You need to build Spark before running this program.
>>
>> Then I did:
>>
>> $ ln -s /home/hbase/spark/assembly/target/scala-2.11
>> assembly/target/scala-2.10
>>
>> Cheers
>>
>> On Mon, Apr 4, 2016 at 4:06 AM, Herman van Hövell tot Westerflier <
>> hvanhov...@questtec.nl> wrote:
>>
>>> No, it can''t. You only need implicits when you are using the catalyst
>>> DSL.
>>>
>>> The error you get is due to the fact that the parser does not recognize
>>> the CODEGEN keyword (which was the case before we introduced this in
>>> https://github.com/apache/spark/commit/fa1af0aff7bde9bbf7bfa6a3ac74699734c2fd8a).
>>> That suggests to me that you are not on the latest master.
>>>
>>> Kind regards,
>>>
>>> Herman van Hövell
>>>
>>> 2016-04-04 12:15 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:
>>>
>>>> Could the error I encountered be due to missing import(s) of implicit ?
>>>>
>>>> Thanks
>>>>
>>>> On Sun, Apr 3, 2016 at 9:42 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Works for me on latest master.
>>>>>
>>>>>
>>>>>
>>>>> scala> sql("explain codegen select 'a' as a group by 1").head
>>>>> res3: org.apache.spark.sql.Row =
>>>>> [Found 2 WholeStageCodegen subtrees.
>>>>> == Subtree 1 / 2 ==
>>>>> WholeStageCodegen
>>>>> :  +- TungstenAggregate(key=[], functions=[], output=[a#10])
>>>>> : +- INPUT
>>>>> +- Exchange SinglePartition, None
>>>>>+- WholeStageCodegen
>>>>>   :  +- TungstenAggregate(key=[], functions=[], output=[])
>>>>>   : +- INPUT
>>>>>   +- Scan OneRowRelation[]
>>>>>
>>>>> Generated code:
>>>>> /* 001 */ public Object generate(Object[] references) {
>>>>> /* 002 */   return new GeneratedIterator(references);
>>>>> /* 003 */ }
>>>>> /* 004 */
>>>>> /* 005 */ /** Codegened pipeline for:
>>>>> /* 006 */ * TungstenAggregate(key=[], functions=[], output=[a#10])
>>>>> /* 007 */ +- INPUT
>>>>> /* 008 */ */
>>>>> /* 009 */ final class GeneratedIterator extends
>>>>> org.apache.spark.sql.execution.BufferedRowIterator {
>>>>> /* 010 */   private Object[] references;
>>>>> /* 011 */   ...
>>>>>
>>>>>
>>>>> On Sun, Apr 3, 2016 at 9:38 PM, Jacek Laskowski <ja...@japila.pl>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Looks related to the recent commit...
>>>>>>
>>>>>> Repository: spark
>>>>>> Updated Branches:
>>>>>>   refs/heads/master 2262a9335 -> 1f0c5dceb
>>>>>>
>>>>>> [SPARK-14350][SQL] EXPLAIN output should be in a single cell
>>>>>>
>>>>>> Jacek
>>>>>> 03.04.2016 7:00 PM "Ted Yu" <yuzhih...@gmail.com> napisał(a):
>>>>>>
>>>>>>> Hi,
>>>>>>> Based on master branch refreshed today, I issued 'git clean -fdx'
>>>>>>> first.
>>>>>>>
>>>>>>> Then this command:
>>>>>>> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
>>>>>>> -Dhadoop.version=2.7.0 package -DskipTests
>>>>>>>
>>>>>>> I got the following error:
>>>>>>>
>>>>>>> scala>  sql("explain codegen select 'a' as a group by 1").head
>>>>>>> org.apache.spark.sql.catalyst.parser.ParseException:
>>>>>>> extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD',
>>>>>>> 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE',
>>>>>>> 'DESCRIBE', 'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP',
>>>>>>> 'SET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH',
>>>>>>> 'CLEAR', 'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE',
>>>>>>> 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT',
>>>>>>> 'LOAD'}(line 1, pos 8)
>>>>>>>
>>>>>>> == SQL ==
>>>>>>> explain codegen select 'a' as a group by 1
>>>>>>> ^^^
>>>>>>>
>>>>>>> Can someone shed light ?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Ted Yu

Maybe temporarily take out the artifacts on S3 before the root cause is
found.

On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Just checking in on this again as the builds on S3 are still broken. :/
>
> Could it have something to do with us moving release-build.sh
> <https://github.com/apache/spark/commits/master/dev/create-release/release-build.sh>
> ?
> 
>
> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Is someone going to retry fixing these packages? It's still a problem.
>>
>> Also, it would be good to understand why this is happening.
>>
>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com> wrote:
>>
>>> I just realized you're using a different download site. Sorry for the
>>> confusion, the link I get for a direct download of Spark 1.6.1 /
>>> Hadoop 2.6 is
>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>>
>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>>> <nicholas.cham...@gmail.com> wrote:
>>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a corrupt
>>> ZIP
>>> > file.
>>> >
>>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the same
>>> Spark
>>> > 1.6.1/Hadoop 2.6 package you had a success with?
>>> >
>>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com>
>>> wrote:
>>> >>
>>> >> I just experienced the issue, however retrying the download a second
>>> >> time worked. Could it be that there is some load balancer/cache in
>>> >> front of the archive and some nodes still serve the corrupt packages?
>>> >>
>>> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
>>> >> <nicholas.cham...@gmail.com> wrote:
>>> >> > I'm seeing the same. :(
>>> >> >
>>> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >> I tried again this morning :
>>> >> >>
>>> >> >> $ wget
>>> >> >>
>>> >> >>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >> --2016-03-18 07:55:30--
>>> >> >>
>>> >> >>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >> Resolving s3.amazonaws.com... 54.231.19.163
>>> >> >> ...
>>> >> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >>
>>> >> >> gzip: stdin: unexpected end of file
>>> >> >> tar: Unexpected EOF in archive
>>> >> >> tar: Unexpected EOF in archive
>>> >> >> tar: Error is not recoverable: exiting now
>>> >> >>
>>> >> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust
>>> >> >> <mich...@databricks.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Patrick reuploaded the artifacts, so it should be fixed now.
>>> >> >>>
>>> >> >>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas"
>>> >> >>> <nicholas.cham...@gmail.com>
>>> >> >>> wrote:
>>> >> >>>>
>>> >> >>>> Looks like the other packages may also be corrupt. I’m getting
>>> the
>>> >> >>>> same
>>> >> >>>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>>> >> >>>>
>>> >> >>>> Nick
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com>
>>> wrote:
>>> >> >>>>>
>>> >> >>>>> On Linux, I got:
>>> >> >>>>>
>>> >> >>>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >>>>>
>>> >> >>>>> gzip: stdin: unexpected end of file
>>> >> >>>>> tar: Unexpected EOF in archive
>>> >> >>>>> tar: Unexpected EOF in archive
>>> >> >>>>> tar: Error is not recoverable: exiting now
>>> >> >>>>>
>>> >> >>>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas
>>> >> >>>>> <nicholas.cham...@gmail.com> wrote:
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >>>>>>
>>> >> >>>>>> Does anyone else have trouble unzipping this? How did this
>>> happen?
>>> >> >>>>>>
>>> >> >>>>>> What I get is:
>>> >> >>>>>>
>>> >> >>>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>> >> >>>>>>
>>> >> >>>>>> Seems like a strange type of problem to come across.
>>> >> >>>>>>
>>> >> >>>>>> Nick
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>
>>> >> >
>>>
>>

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Ted Yu

bq. the modifications do not touch the scheduler

If the changes can be ported over to 1.6.1, do you mind reproducing the
issue there ?

I ask because master branch changes very fast. It would be good to narrow
the scope where the behavior you observed started showing.

On Mon, Apr 4, 2016 at 6:12 AM, Mike Hynes <91m...@gmail.com> wrote:

> [ CC'ing dev list since nearly identical questions have occurred in
> user list recently w/o resolution;
> c.f.:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html
> ]
>
> Hello,
>
> In short, I'm reporting a problem concerning load imbalance of RDD
> partitions across a standalone cluster. Though there are 16 cores
> available per node, certain nodes will have >16 partitions, and some
> will correspondingly have <16 (and even 0).
>
> In more detail: I am running some scalability/performance tests for
> vector-type operations. The RDDs I'm considering are simple block
> vectors of type RDD[(Int,Vector)] for a Breeze vector type. The RDDs
> are generated with a fixed number of elements given by some multiple
> of the available cores, and subsequently hash-partitioned by their
> integer block index.
>
> I have verified that the hash partitioning key distribution, as well
> as the keys themselves, are both correct; the problem is truly that
> the partitions are *not* evenly distributed across the nodes.
>
> For instance, here is a representative output for some stages and
> tasks in an iterative program. This is a very simple test with 2
> nodes, 64 partitions, 32 cores (16 per node), and 2 executors. Two
> examples stages from the stderr log are stages 7 and 9:
> 7,mapPartitions at DummyVector.scala:113,64,1459771364404,1459771365272
> 9,mapPartitions at DummyVector.scala:113,64,1459771364431,1459771365639
>
> When counting the location of the partitions on the compute nodes from
> the stderr logs, however, you can clearly see the imbalance. Examples
> lines are:
> 13627 task 0.0 in stage 7.0 (TID 196,
> himrod-2, partition 0,PROCESS_LOCAL, 3987 bytes)&
> 13628 task 1.0 in stage 7.0 (TID 197,
> himrod-2, partition 1,PROCESS_LOCAL, 3987 bytes)&
> 13629 task 2.0 in stage 7.0 (TID 198,
> himrod-2, partition 2,PROCESS_LOCAL, 3987 bytes)&
>
> Grep'ing the full set of above lines for each hostname, himrod-?,
> shows the problem occurs in each stage. Below is the output, where the
> number of partitions stored on each node is given alongside its
> hostname as in (himrod-?,num_partitions):
> Stage 7: (himrod-1,0) (himrod-2,64)
> Stage 9: (himrod-1,16) (himrod-2,48)
> Stage 12: (himrod-1,0) (himrod-2,64)
> Stage 14: (himrod-1,16) (himrod-2,48)
> The imbalance is also visible when the executor ID is used to count
> the partitions operated on by executors.
>
> I am working off a fairly recent modification of 2.0.0-SNAPSHOT branch
> (but the modifications do not touch the scheduler, and are irrelevant
> for these particular tests). Has something changed radically in 1.6+
> that would make a previously (<=1.5) correct configuration go haywire?
> Have new configuration settings been added of which I'm unaware that
> could lead to this problem?
>
> Please let me know if others in the community have observed this, and
> thank you for your time,
> Mike
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: explain codegen

2016-04-04 Thread Ted Yu

The commit you mentioned was made Friday.
I refreshed workspace Sunday - so it was included.

Maybe this was related:

$ bin/spark-shell
Failed to find Spark jars directory
(/home/hbase/spark/assembly/target/scala-2.10).
You need to build Spark before running this program.

Then I did:

$ ln -s /home/hbase/spark/assembly/target/scala-2.11
assembly/target/scala-2.10

Cheers

On Mon, Apr 4, 2016 at 4:06 AM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> No, it can''t. You only need implicits when you are using the catalyst DSL.
>
> The error you get is due to the fact that the parser does not recognize
> the CODEGEN keyword (which was the case before we introduced this in
> https://github.com/apache/spark/commit/fa1af0aff7bde9bbf7bfa6a3ac74699734c2fd8a).
> That suggests to me that you are not on the latest master.
>
> Kind regards,
>
> Herman van Hövell
>
> 2016-04-04 12:15 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:
>
>> Could the error I encountered be due to missing import(s) of implicit ?
>>
>> Thanks
>>
>> On Sun, Apr 3, 2016 at 9:42 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Works for me on latest master.
>>>
>>>
>>>
>>> scala> sql("explain codegen select 'a' as a group by 1").head
>>> res3: org.apache.spark.sql.Row =
>>> [Found 2 WholeStageCodegen subtrees.
>>> == Subtree 1 / 2 ==
>>> WholeStageCodegen
>>> :  +- TungstenAggregate(key=[], functions=[], output=[a#10])
>>> : +- INPUT
>>> +- Exchange SinglePartition, None
>>>+- WholeStageCodegen
>>>   :  +- TungstenAggregate(key=[], functions=[], output=[])
>>>   : +- INPUT
>>>   +- Scan OneRowRelation[]
>>>
>>> Generated code:
>>> /* 001 */ public Object generate(Object[] references) {
>>> /* 002 */   return new GeneratedIterator(references);
>>> /* 003 */ }
>>> /* 004 */
>>> /* 005 */ /** Codegened pipeline for:
>>> /* 006 */ * TungstenAggregate(key=[], functions=[], output=[a#10])
>>> /* 007 */ +- INPUT
>>> /* 008 */ */
>>> /* 009 */ final class GeneratedIterator extends
>>> org.apache.spark.sql.execution.BufferedRowIterator {
>>> /* 010 */   private Object[] references;
>>> /* 011 */   ...
>>>
>>>
>>> On Sun, Apr 3, 2016 at 9:38 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>>>
>>>> Hi,
>>>>
>>>> Looks related to the recent commit...
>>>>
>>>> Repository: spark
>>>> Updated Branches:
>>>>   refs/heads/master 2262a9335 -> 1f0c5dceb
>>>>
>>>> [SPARK-14350][SQL] EXPLAIN output should be in a single cell
>>>>
>>>> Jacek
>>>> 03.04.2016 7:00 PM "Ted Yu" <yuzhih...@gmail.com> napisał(a):
>>>>
>>>>> Hi,
>>>>> Based on master branch refreshed today, I issued 'git clean -fdx'
>>>>> first.
>>>>>
>>>>> Then this command:
>>>>> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
>>>>> -Dhadoop.version=2.7.0 package -DskipTests
>>>>>
>>>>> I got the following error:
>>>>>
>>>>> scala>  sql("explain codegen select 'a' as a group by 1").head
>>>>> org.apache.spark.sql.catalyst.parser.ParseException:
>>>>> extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD',
>>>>> 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE',
>>>>> 'DESCRIBE', 'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP',
>>>>> 'SET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH',
>>>>> 'CLEAR', 'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE',
>>>>> 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT',
>>>>> 'LOAD'}(line 1, pos 8)
>>>>>
>>>>> == SQL ==
>>>>> explain codegen select 'a' as a group by 1
>>>>> ^^^
>>>>>
>>>>> Can someone shed light ?
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>
>>
>

Re: explain codegen

2016-04-04 Thread Ted Yu

Could the error I encountered be due to missing import(s) of implicit ?

Thanks

On Sun, Apr 3, 2016 at 9:42 PM, Reynold Xin <r...@databricks.com> wrote:

> Works for me on latest master.
>
>
>
> scala> sql("explain codegen select 'a' as a group by 1").head
> res3: org.apache.spark.sql.Row =
> [Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[], functions=[], output=[a#10])
> : +- INPUT
> +- Exchange SinglePartition, None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[], functions=[], output=[])
>   : +- INPUT
>   +- Scan OneRowRelation[]
>
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ /** Codegened pipeline for:
> /* 006 */ * TungstenAggregate(key=[], functions=[], output=[a#10])
> /* 007 */ +- INPUT
> /* 008 */ */
> /* 009 */ final class GeneratedIterator extends
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 010 */   private Object[] references;
> /* 011 */   ...
>
>
> On Sun, Apr 3, 2016 at 9:38 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> Looks related to the recent commit...
>>
>> Repository: spark
>> Updated Branches:
>>   refs/heads/master 2262a9335 -> 1f0c5dceb
>>
>> [SPARK-14350][SQL] EXPLAIN output should be in a single cell
>>
>> Jacek
>> 03.04.2016 7:00 PM "Ted Yu" <yuzhih...@gmail.com> napisał(a):
>>
>>> Hi,
>>> Based on master branch refreshed today, I issued 'git clean -fdx' first.
>>>
>>> Then this command:
>>> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
>>> -Dhadoop.version=2.7.0 package -DskipTests
>>>
>>> I got the following error:
>>>
>>> scala>  sql("explain codegen select 'a' as a group by 1").head
>>> org.apache.spark.sql.catalyst.parser.ParseException:
>>> extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD',
>>> 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE',
>>> 'DESCRIBE', 'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP',
>>> 'SET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH',
>>> 'CLEAR', 'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE',
>>> 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT',
>>> 'LOAD'}(line 1, pos 8)
>>>
>>> == SQL ==
>>> explain codegen select 'a' as a group by 1
>>> ^^^
>>>
>>> Can someone shed light ?
>>>
>>> Thanks
>>>
>>
>

explain codegen

2016-04-03 Thread Ted Yu

Hi,
Based on master branch refreshed today, I issued 'git clean -fdx' first.

Then this command:
build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
-Dhadoop.version=2.7.0 package -DskipTests

I got the following error:

scala>  sql("explain codegen select 'a' as a group by 1").head
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC',
'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE',
'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET',
'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH', 'CLEAR',
'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE', 'REVOKE',
'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos
8)

== SQL ==
explain codegen select 'a' as a group by 1
^^^

Can someone shed light ?

Thanks

Re: OOM and "spark.buffer.pageSize"

2016-03-28 Thread Ted Yu

I guess you have looked at MemoryManager#pageSizeBytes where
the "spark.buffer.pageSize" config can override default page size.

FYI

On Mon, Mar 28, 2016 at 12:07 PM, Steve Johnston <
sjohns...@algebraixdata.com> wrote:

> I'm attempting to address an OOM issue. I saw referenced in
> java.lang.OutOfMemoryError: Unable to acquire bytes of memory
> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-Unable-to-acquire-bytes-of-memory-td16773.html
> >
> the configuration setting "spark.buffer.pageSize" which was used in
> conjunction with "spark.sql.shuffle.partitions" to solve the OOM problem
> Nezih was having.
>
> What is "spark.buffer.pageSize"? How can it be used? I can find it in the
> code but there doesn't seem to be any other documentation.
>
> Thanks,
> Steve
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/OOM-and-spark-buffer-pageSize-tp16890.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: BlockManager WARNINGS and ERRORS

2016-03-27 Thread Ted Yu

The warning was added by:

SPARK-12757 Add block-level read/write locks to BlockManager

On Sun, Mar 27, 2016 at 12:24 PM, salexln  wrote:

> HI all,
>
> I started testing my code (https://github.com/salexln/FinalProject_FCM)
> with the latest Spark available in GitHub,
> and when I run it I get the following errors:
>
> *scala> val clusters = FuzzyCMeans.train(parsedData, 2, 20, 2.0)*
>
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 ERROR Executor: 2 block locks were not released by TID =
> 32:
> [rdd_8_0, rdd_35_0]
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 ERROR Executor: 2 block locks were not released by TID =
> 35:
> [rdd_8_0, rdd_35_0]
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 ERROR Executor: 2 block locks were not released by TID =
> 38:
> [rdd_8_0, rdd_35_0]
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 ERROR Executor: 2 block locks were not released by TID =
> 41:
> [rdd_8_0, rdd_35_0]
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 ERROR Executor: 2 block locks were not released by TID =
> 44:
> [rdd_8_0, rdd_35_0]
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_8_0 already exists on this
> machine; not re-adding it
> 16/03/27 22:24:10 WARN BlockManager: Block rdd_35_0 already exists on this
> machine; not re-adding it
>
> I did not get these previously, is it something new?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/BlockManager-WARNINGS-and-ERRORS-tp16878.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: error occurs to compile spark 1.6.1 using scala 2.11.8

2016-03-22 Thread Ted Yu

>From the error message, it seems some artifacts from Scala 2.10.4 were left
around.

FYI maven 3.3.9 is required for master branch.

On Tue, Mar 22, 2016 at 3:07 AM, Allen  wrote:

> Hi,
>
> I am facing an error when doing compilation from IDEA, please see the
> attached. I fired the build process by clicking "Rebuild Project" in
> "Build" menu in IDEA IDE.
>
> more info here:
> Spark 1.6.1 + scala 2.11.8 + IDEA 15.0.3 + Maven 3.3.3
>
> I can build spark 1.6.1 with scala 2.10.4 successfully in the same way.
>
>
> Help!
> BR,
> Allen Zhang
>
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

Re: Performance improvements for sorted RDDs

2016-03-21 Thread Ted Yu

Do you have performance numbers to backup this proposal for cogroup
operation ?

Thanks

On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ <
joaquin.guantergonzal...@telefonica.com> wrote:

> Hello devs,
>
>
>
> I have found myself in a situation where Spark is doing sub-optimal
> computations for my RDDs, and I was wondering whether a patch to enable
> improved performance for this scenario would be a welcome addition to Spark
> or not.
>
>
>
> The scenario happens when trying to cogroup two RDDs that are sorted by
> key and share the same partitioner. CoGroupedRDD will correctly detect that
> the RDDs have the same partitioner and will therefore create narrow cogroup
> split dependencies, as opposed to shuffle dependencies. This is great
> because it prevents any shuffling from happening. However, the cogroup is
> unable to detect that the RDDs are sorted in the same way, and will still
> insert all elements of the RDD in a map in order to join the elements with
> the same key.
>
>
>
> When both RDDs are sorted using the same order, the cogroup can just join
> by doing a single pass over the data (since the data is ordered by key, you
> can just keep iterating until you find a different key). This would greatly
> reduce the memory requirements for these kind of operations.
>
>
>
> Adding this to spark would require adding an “ordering” member to RDD of
> type Option[Ordering], similarly to how the “partitioner” field works. That
> way, the sorting operations could populate this field and the operations
> that could benefit from this knowledge (cogroup, join, groupbykey, etc.)
> could read it to change their behavior accordingly.
>
>
>
> Do you think this would be a good addition to Spark?
>
>
>
> Thanks,
>
> Ximo
>
> --
>
> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
> puede contener información privilegiada o confidencial y es para uso
> exclusivo de la persona o entidad de destino. Si no es usted. el
> destinatario indicado, queda notificado de que la lectura, utilización,
> divulgación y/o copia sin autorización puede estar prohibida en virtud de
> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
> que nos lo comunique inmediatamente por esta misma vía y proceda a su
> destrucción.
>
> The information contained in this transmission is privileged and
> confidential information intended only for the use of the individual or
> entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If you have received
> this transmission in error, do not read it. Please immediately reply to the
> sender that you have received this communication in error and then delete
> it.
>
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo
> da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou
> cópia sem autorização pode estar proibida em virtude da legislação vigente.
> Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique
> imediatamente por esta mesma via e proceda a sua destruição
>

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu

I tried again this morning :

$ wget
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
--2016-03-18 07:55:30--
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
Resolving s3.amazonaws.com... 54.231.19.163
...
$ tar zxf spark-1.6.1-bin-hadoop2.6.tgz

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> Patrick reuploaded the artifacts, so it should be fixed now.
> On Mar 16, 2016 5:48 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com>
> wrote:
>
>> Looks like the other packages may also be corrupt. I’m getting the same
>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>>
>>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>>
>> Nick
>> 
>>
>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> On Linux, I got:
>>>
>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>
>>> gzip: stdin: unexpected end of file
>>> tar: Unexpected EOF in archive
>>> tar: Unexpected EOF in archive
>>> tar: Error is not recoverable: exiting now
>>>
>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>>
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>>
>>>> Does anyone else have trouble unzipping this? How did this happen?
>>>>
>>>> What I get is:
>>>>
>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>>>
>>>> Seems like a strange type of problem to come across.
>>>>
>>>> Nick
>>>> 
>>>>
>>>
>>>

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu

On Linux, I got:

$ tar zxf spark-1.6.1-bin-hadoop2.6.tgz

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>
> Does anyone else have trouble unzipping this? How did this happen?
>
> What I get is:
>
> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>
> Seems like a strange type of problem to come across.
>
> Nick
> 
>

1 2 3 4 >

1 - 100 of 331 matches

Mail list logo