Re: Slack for PySpark users

2023-03-30 Thread Xiao Li
Hi, Dongjoon,

The other communities (e.g., Pinot, Druid, Flink) created their own Slack
workspaces last year. I did not see an objection from the ASF board. At the
same time, Slack workspaces are very popular and useful in most non-ASF
open source communities. TBH, we are kind of late. I think we can do the
same in our community?

We can follow the guide when the ASF has an official process for ASF
archiving. Since our PMC are the owner of the slack workspace, we can make
a change based on the policy. WDYT?

Xiao


Dongjoon Hyun  于2023年3月30日周四 09:03写道:

> Hi, Xiao and all.
>
> (cc Matei)
>
> Please hold on the vote.
>
> There is a concern expressed by ASF board because recent Slack activities
> created an isolated silo outside of ASF mailing list archive.
>
> We need to establish a way to embrace it back to ASF archive before
> starting anything official.
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 29, 2023 at 11:32 PM Xiao Li  wrote:
>
>> +1
>>
>> + @d...@spark.apache.org 
>>
>> This is a good idea. The other Apache projects (e.g., Pinot, Druid,
>> Flink) have created their own dedicated Slack workspaces for faster
>> communication. We can do the same in Apache Spark. The Slack workspace will
>> be maintained by the Apache Spark PMC. I propose to initiate a vote for the
>> creation of a new Apache Spark Slack workspace. Does that sound good?
>>
>> Cheers,
>>
>> Xiao
>>
>>
>>
>>
>>
>>
>>
>> Mich Talebzadeh  于2023年3月28日周二 07:07写道:
>>
>>> I created one at slack called pyspark
>>>
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 28 Mar 2023 at 03:52, asma zgolli  wrote:
>>>
>>>> +1 good idea, I d like to join as well.
>>>>
>>>> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
>>>> écrit :
>>>>
>>>>> Please let us know when the channel is created. I'd like to join :)
>>>>>
>>>>> Thank You & Best Regards
>>>>> Winston Lai
>>>>> --
>>>>> *From:* Denny Lee 
>>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>>>> *To:* Hyukjin Kwon 
>>>>> *Cc:* keen ; user@spark.apache.org <
>>>>> user@spark.apache.org>
>>>>> *Subject:* Re: Slack for PySpark users
>>>>>
>>>>> +1 I think this is a great idea!
>>>>>
>>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>> Yeah, actually I think we should better have a slack channel so we can
>>>>> easily discuss with users and developers.
>>>>>
>>>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>>>
>>>>> Hi all,
>>>>> I really like *Slack *as communication channel for a tech community.
>>>>> There is a Slack workspace for *delta lake users* (
>>>>> https://go.delta.io/slack) that I enjoy a lot.
>>>>> I was wondering if there is something similar for PySpark users.
>>>>>
>>>>> If not, would there be anything wrong with creating a new
>>>>> Slack workspace for PySpark users? (when explicitly mentioning that this 
>>>>> is
>>>>> *not* officially part of Apache Spark)?
>>>>>
>>>>> Cheers
>>>>> Martin
>>>>>
>>>>>
>>>>
>>>> --
>>>> Asma ZGOLLI
>>>>
>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>
>>>>


Re: Slack for PySpark users

2023-03-30 Thread Xiao Li
+1

+ @d...@spark.apache.org 

This is a good idea. The other Apache projects (e.g., Pinot, Druid, Flink)
have created their own dedicated Slack workspaces for faster communication.
We can do the same in Apache Spark. The Slack workspace will be maintained
by the Apache Spark PMC. I propose to initiate a vote for the creation of a
new Apache Spark Slack workspace. Does that sound good?

Cheers,

Xiao







Mich Talebzadeh  于2023年3月28日周二 07:07写道:

> I created one at slack called pyspark
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 28 Mar 2023 at 03:52, asma zgolli  wrote:
>
>> +1 good idea, I d like to join as well.
>>
>> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
>> écrit :
>>
>>> Please let us know when the channel is created. I'd like to join :)
>>>
>>> Thank You & Best Regards
>>> Winston Lai
>>> --
>>> *From:* Denny Lee 
>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>> *To:* Hyukjin Kwon 
>>> *Cc:* keen ; user@spark.apache.org <
>>> user@spark.apache.org>
>>> *Subject:* Re: Slack for PySpark users
>>>
>>> +1 I think this is a great idea!
>>>
>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>> wrote:
>>>
>>> Yeah, actually I think we should better have a slack channel so we can
>>> easily discuss with users and developers.
>>>
>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>
>>> Hi all,
>>> I really like *Slack *as communication channel for a tech community.
>>> There is a Slack workspace for *delta lake users* (
>>> https://go.delta.io/slack) that I enjoy a lot.
>>> I was wondering if there is something similar for PySpark users.
>>>
>>> If not, would there be anything wrong with creating a new
>>> Slack workspace for PySpark users? (when explicitly mentioning that this is
>>> *not* officially part of Apache Spark)?
>>>
>>> Cheers
>>> Martin
>>>
>>>
>>
>> --
>> Asma ZGOLLI
>>
>> Ph.D. in Big Data - Applied Machine Learning
>>
>>


Stickers and Swag

2022-06-14 Thread Xiao Li
Hi, all,

The ASF has an official store at RedBubble
 that Apache Community
Development (ComDev) runs. If you are interested in buying Spark Swag, 70
products featuring the Spark logo are available:
https://www.redbubble.com/shop/ap/113203780

Go Spark!

Xiao


Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Xiao Li
Thank you, Gengliang!

Congrats to our community and all the contributors!

Xiao

Henrik Peng  于2021年10月19日周二 上午8:26写道:

> Congrats and thanks!
>
>
> Gengliang Wang 于2021年10月19日 周二下午10:16写道:
>
>> Hi all,
>>
>> Apache Spark 3.2.0 is the third release of the 3.x line. With tremendous
>> contribution from the open-source community, this release managed to
>> resolve in excess of 1,700 Jira tickets.
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 3.2.0, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-2-0.html
>>
>


Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Xiao Li
Thank you!

Xiao

On Tue, Jun 1, 2021 at 9:29 PM Hyukjin Kwon  wrote:

> awesome!
>
> 2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 님이 작성:
>
>> We are happy to announce the availability of Spark 3.1.2!
>>
>> Spark 3.1.2 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.1 maintenance branch of Spark. We
>> strongly
>> recommend all 3.1 users to upgrade to this stable release.
>>
>> To download Spark 3.1.2, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-1-2.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Dongjoon Hyun
>>
>

--


Re: Spark and Bintray's shutdown

2021-04-12 Thread Xiao Li
Not all the Spark packages in https://spark-packages.org/ are eligible for
maven central. We are looking for the replacement of Bintray for
spark-packages.org.

Bo Zhang is actively working on this. Bo, can you share your ideas with the
community?

Cheers,

Xiao

On Mon, Apr 12, 2021 at 9:28 AM Sean Owen  wrote:

> Spark itself is distributed via Maven Central primarily, so I don't think
> it will be affected?
>
> On Mon, Apr 12, 2021 at 11:22 AM Florian CASTELAIN <
> florian.castel...@redlab.io> wrote:
>
>> Hello.
>>
>>
>>
>> Bintray will shutdown on first May.
>>
>>
>>
>> I just saw that packages are hosted on Bintray (which is actually down
>> for maintenance).
>>
>>
>>
>> What will happen after first May ? Is there any maintenance to do in
>> projects to still be able to download spark dependencies ?
>>
>>
>>
>> Regards !
>>
>>
>>
>> *[image: signature_299490615]* 
>>
>>
>>
>> [image: Banner] 
>>
>>
>>
>> *Florian CASTELAIN *
>> *Ingénieur logiciel*
>>
>> 72 Rue de la République, 76140 Le Petit-Quevilly
>> m: +33 616 530 226
>> e: florian.castel...@redlab.io w: www.redlab.io
>>
>>
>>
>>
>>
>

--


Re: [UPDATE] Apache Spark 3.1.0 Release Window

2020-10-12 Thread Xiao Li
Thank you, Dongjoon

Xiao

On Mon, Oct 12, 2020 at 4:19 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark 3.1.0 Release Window is adjusted like the following today.
> Please check the latest information on the official website.
>
> -
> https://github.com/apache/spark-website/commit/0cd0bdc80503882b4737db7e77cc8f9d17ec12ca
> - https://spark.apache.org/versioning-policy.html
>
> Bests,
> Dongjoon.
>


--


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Xiao Li
As pointed out by Dongjoon, the 2nd half of December is the holiday season
in most countries. If we do the code freeze in mid November and release the
first RC in mid December. I am afraid the community will not be active to
verify the release candidates during the holiday season. Normally, the RC
stage is the most critical period to detect the defects and unexpected
behavior changes. Thus, starting the RC in the next January might be a good
option IMHO.

Cheers,

Xiao


Igor Dvorzhak  于2020年10月4日周日 下午10:35写道:

> Why to move the code freeze to early December? Seems like even according
> to the changed release cadence the code freeze should happen in
> mid-November.
>
> On Sun, Oct 4, 2020 at 6:26 PM Xiao Li  wrote:
>
>> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
>>
>>
>> I think we made a change in release cadence since Spark 2.3. See the
>> commit:
>> https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa
>> Thus, Spark 3.1 might just follow the release cadence of Spark 2.3/2.4, if
>> we do not want to change the release cadence?
>>
>> How about moving the code freeze of Spark 3.1 to *Early Dec 2020* and
>> the RC1 date to* Early Jan 2021*?
>>
>> Thanks,
>>
>> Xiao
>>
>>
>> Dongjoon Hyun  于2020年10月4日周日 下午12:44写道:
>>
>>> For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
>>> different from 2.3 or 2.4.
>>>
>>> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
>>>
>>> - Apache Spark 2.0.0 was released on July 26, 2016.
>>> - Apache Spark 2.1.0 was released on December 28, 2016.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Thank you all.
>>>>
>>>> BTW, Xiao and Mridul, I'm wondering what date you have in your mind
>>>> specifically.
>>>>
>>>> Usually, `Christmas and New Year season` doesn't give us much
>>>> additional time.
>>>>
>>>> If you think so, could you make a PR for Apache Spark website
>>>> according to your expectation?
>>>>
>>>> https://spark.apache.org/versioning-policy.html
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan 
>>>> wrote:
>>>>
>>>>>
>>>>> +1 on pushing the branch cut for increased dev time to match previous
>>>>> releases.
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:
>>>>>
>>>>>> Thank you for your updates.
>>>>>>
>>>>>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date
>>>>>> of the 3.1 branch cut, the feature development time window is less than 5
>>>>>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>>>>>
>>>>>> Below are three highly desirable feature work I am watching.
>>>>>> Hopefully, we can finish them before the branch cut.
>>>>>>
>>>>>>- Support push-based shuffle to improve shuffle efficiency:
>>>>>>https://issues.apache.org/jira/browse/SPARK-30602
>>>>>>- Unify create table syntax:
>>>>>>https://issues.apache.org/jira/browse/SPARK-31257
>>>>>>- Bloom filter join:
>>>>>>https://issues.apache.org/jira/browse/SPARK-32268
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Xiao
>>>>>>
>>>>>>
>>>>>> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:
>>>>>>
>>>>>>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>>>>>>> dropped R 3.5 and below at branch 2.4 as well.
>>>>>>>
>>>>>>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, All.
>>>>>>>>
>>>>>>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>>>>>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>>>>>>> According to the 3.1.0 release window, branch-3.1 will be
>>>>>>&

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Xiao Li
>
> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.


I think we made a change in release cadence since Spark 2.3. See the
commit:
https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa
Thus, Spark 3.1 might just follow the release cadence of Spark 2.3/2.4, if
we do not want to change the release cadence?

How about moving the code freeze of Spark 3.1 to *Early Dec 2020* and the
RC1 date to* Early Jan 2021*?

Thanks,

Xiao


Dongjoon Hyun  于2020年10月4日周日 下午12:44写道:

> For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
> different from 2.3 or 2.4.
>
> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
>
> - Apache Spark 2.0.0 was released on July 26, 2016.
> - Apache Spark 2.1.0 was released on December 28, 2016.
>
> Bests,
> Dongjoon.
>
>
> On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun 
> wrote:
>
>> Thank you all.
>>
>> BTW, Xiao and Mridul, I'm wondering what date you have in your mind
>> specifically.
>>
>> Usually, `Christmas and New Year season` doesn't give us much additional
>> time.
>>
>> If you think so, could you make a PR for Apache Spark website according
>> to your expectation?
>>
>> https://spark.apache.org/versioning-policy.html
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1 on pushing the branch cut for increased dev time to match previous
>>> releases.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:
>>>
>>>> Thank you for your updates.
>>>>
>>>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date
>>>> of the 3.1 branch cut, the feature development time window is less than 5
>>>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>>>
>>>> Below are three highly desirable feature work I am watching. Hopefully,
>>>> we can finish them before the branch cut.
>>>>
>>>>- Support push-based shuffle to improve shuffle efficiency:
>>>>https://issues.apache.org/jira/browse/SPARK-30602
>>>>- Unify create table syntax:
>>>>https://issues.apache.org/jira/browse/SPARK-31257
>>>>- Bloom filter join:
>>>>https://issues.apache.org/jira/browse/SPARK-32268
>>>>
>>>> Thanks,
>>>>
>>>> Xiao
>>>>
>>>>
>>>> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:
>>>>
>>>>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>>>>> dropped R 3.5 and below at branch 2.4 as well.
>>>>>
>>>>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, 
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>>>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>>>>> According to the 3.1.0 release window, branch-3.1 will be
>>>>>> created on November 1st and enters QA period.
>>>>>>
>>>>>> Here are some notable updates I've been monitoring.
>>>>>>
>>>>>> *Language*
>>>>>> 01. SPARK-25075 Support Scala 2.13
>>>>>>   - Since SPARK-32926, Scala 2.13 build test has
>>>>>> become a part of GitHub Action jobs.
>>>>>>   - After SPARK-33044, Scala 2.13 test will be
>>>>>> a part of Jenkins jobs.
>>>>>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>>>>>> 03. SPARK-32082 Project Zen: Improving Python usability
>>>>>>   - 7 of 16 issues are resolved.
>>>>>> 04. SPARK-32073 Drop R < 3.5 support
>>>>>>   - This is done for Spark 3.0.1 and 3.1.0.
>>>>>>
>>>>>> *Dependency*
>>>>>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>>>>>   - This changes the default dist. for better cloud support
>>>>>> 06. SPARK-32981 Remove hive-1.2 distribution
>>>>>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>>>>>   - This will remove Hive 1.2.1 from source code
>>>>>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>>>>>
>>>>>> *Core*
>>>>>>

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Xiao Li
Thank you for your updates.

Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
the 3.1 branch cut, the feature development time window is less than 5
months. This is shorter than what we did in Spark 2.3 and 2.4 releases.

Below are three highly desirable feature work I am watching. Hopefully, we
can finish them before the branch cut.

   - Support push-based shuffle to improve shuffle efficiency:
   https://issues.apache.org/jira/browse/SPARK-30602
   - Unify create table syntax:
   https://issues.apache.org/jira/browse/SPARK-31257
   - Bloom filter join: https://issues.apache.org/jira/browse/SPARK-32268

Thanks,

Xiao


Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:

> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
> dropped R 3.5 and below at branch 2.4 as well.
>
> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun,  wrote:
>
>> Hi, All.
>>
>> As of today, master branch (Apache Spark 3.1.0) resolved
>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>> According to the 3.1.0 release window, branch-3.1 will be
>> created on November 1st and enters QA period.
>>
>> Here are some notable updates I've been monitoring.
>>
>> *Language*
>> 01. SPARK-25075 Support Scala 2.13
>>   - Since SPARK-32926, Scala 2.13 build test has
>> become a part of GitHub Action jobs.
>>   - After SPARK-33044, Scala 2.13 test will be
>> a part of Jenkins jobs.
>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>> 03. SPARK-32082 Project Zen: Improving Python usability
>>   - 7 of 16 issues are resolved.
>> 04. SPARK-32073 Drop R < 3.5 support
>>   - This is done for Spark 3.0.1 and 3.1.0.
>>
>> *Dependency*
>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>   - This changes the default dist. for better cloud support
>> 06. SPARK-32981 Remove hive-1.2 distribution
>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>   - This will remove Hive 1.2.1 from source code
>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>
>> *Core*
>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>   - 11 of 15 issues are resolved
>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>   - 8 of 14 issues are resolved
>>
>> *Resource Manager*
>> 11. SPARK-33005 Kubernetes GA preparation
>>   - It is on the way and we are waiting for more feedback.
>>
>> *SQL*
>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>   to JSON/Avro
>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>   - 11 of 17 issues are resolved
>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>   and added more features in 3.1 but still we missed
>>   - All built-in DataSource v2 write paths are disabled
>> and v1 write is used instead.
>>   - Support partition pruning with subqueries
>>   - Support bucketing
>>
>> We still have one month before the feature freeze
>> and starting QA. If you are working for 3.1,
>> please consider the timeline and share your schedule
>> with the Apache Spark community. For the other stuff,
>> we can put it into 3.2 release scheduled in June 2021.
>>
>> Last not but least, I want to emphasize (7) once again.
>> We need to remove the forked unofficial Hive eventually.
>> Please let us know your reasons if you need to build
>> from Apache Spark 3.1 source code for Hive 1.2.
>>
>> https://github.com/apache/spark/pull/29936
>>
>> As I wrote in the above PR description, for old releases,
>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>> Hive 1.2-based distribution.
>>
>> Bests,
>> Dongjoon.
>>
>


Re: Spark UI

2020-07-19 Thread Xiao Li
https://spark.apache.org/docs/3.0.0/web-ui.html is the official doc
for Spark UI.

Xiao

On Sun, Jul 19, 2020 at 1:38 PM venkatadevarapu 
wrote:

> Hi,
>
> I'm looking for a tutorial/video/material which explains the content of
> various tabes in SPARK WEB UI.
> Can some one direct me with the relevant info.
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 



Re: How to disable pushdown predicate in spark 2.x query

2020-06-22 Thread Xiao Li
Just turn off the JDBC option pushDownPredicate, which was introduced in
Spark 2.4. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Xiao

On Mon, Jun 22, 2020 at 11:36 AM Mohit Durgapal 
wrote:

> Hi All,
>
> I am trying to read a table of a relational database using spark 2.x.
>
> I am using code like the following:
>
> sparkContext.read().jdbc(url, table ,
> connectionProperties).select('SELECT_COLUMN').where(whereClause);
>
>
> Now, What's happening is spark is actually the SQL query which spark is
> running against the relational db is :
>
> select column,(where_clause_columns) from table WHERE SELECT_COLUMN IS NOT
> NULL;
>
> And I guess it is doing filtering based on the where clause only after
> fetching all the data from DB where SELECT_COLUMN IS NOT NULL.
>
> I searched about it and found out this is because of pushdown predicate.
> Is there a way to load data into dataframe using specific query instead of
> this.
>
> I found a solution where if we provide actual query instead of the table
> name in the following code, it should run that query exactly:
>
> table = "select SELECT_COLUMN from table  "+ whereClause;
> sparkContext.read().jdbc(url, table ,
> connectionProperties).select('SELECT_COLUMN').where(whereClause);
>
>
> Does the above seem like a good solution?
>
>
> Regards,
> Mohit
>


-- 



Re: Different execution results with wholestage codegen on and off

2020-05-27 Thread Xiao Li
Thanks for reporting it. Please open a JIRA with a test case.

Cheers,

Xiao

On Wed, May 27, 2020 at 1:42 PM Pasha Finkelshteyn <
pavel.finkelsht...@gmail.com> wrote:

> Hi folks,
>
> I'm implementing Kotlin bindings for Spark and faced strange problem. In
> one cornercase Spark works differently when wholestage codegen is on or
> off.
>
> Does it look like bug ot expected behavior?
> --
> Regards,
> Pasha
>
> Big Data Tools @ JetBrains
>


-- 



Re: Why were changes of SPARK-9241 removed?

2020-03-12 Thread Xiao Li
I do not think we intentionally dropped it. Could you open a ticket in
Spark JIRA with your query?

Cheers,

Xiao

On Thu, Mar 12, 2020 at 8:24 PM 马阳阳  wrote:

> Hi,
> I wonder why the changes made in
> "[SPARK-9241][SQL] Supporting
> multiple DISTINCT columns (2) -
> Rewriting Rule" are not present in
> Spark (verson 2.4) now. This caused
> execution of count distinct in Spark
> much slower than Spark 1.6 and hive
> (Spark 2.4.4 more than 18 minutes;
> hive about 80s, spark 1.6 about 3
> minutes).
>
>
> --
> Sent from Postbox 
>


-- 



Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Xiao Li
If you can confirm that this is caused by Apache Spark, feel free to open a
JIRA. In each release, I do not expect your queries should hit such a major
performance regression. Also, please try the 3.0 preview releases.

Thanks,

Xiao

Kalin Stoyanov  于2020年1月15日周三 上午10:53写道:

> Hi Xiao,
>
> Thanks, I didn't know that. This
> https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/
> implies that their fork is not used in emr 5.27. I tried that and it has
> the same issue. But then again in their article they were comparing emr
> 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest
> version of Spark locally and make the comparison that way.
>
> Regards,
> Kalin
>
> On Wed, Jan 15, 2020 at 7:58 PM Xiao Li  wrote:
>
>> EMR is having their own fork of Spark, called EMR runtime. They are not
>> Apache Spark. You might need to talk with them instead of posting questions
>> in the Apache Spark community.
>>
>> Cheers,
>>
>> Xiao
>>
>> Kalin Stoyanov  于2020年1月15日周三 上午9:53写道:
>>
>>> Hi all,
>>>
>>> First of all let me say that I am pretty new to Spark so this could be
>>> entirely my fault somehow...
>>> I noticed this when I was running a job on an amazon emr cluster with
>>> Spark 2.4.4, and it got done slower than when I had ran it locally (on
>>> Spark 2.4.1). I checked out the event logs, and the one from the newer
>>> version had more stages.
>>> Then I decided to do a comparison in the same environment so I created
>>> the two versions of the same cluster with the only difference being the emr
>>> release, and hence the spark version(?) - first one was emr-5.24.1 with
>>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
>>> the same thing happened with the newer version having more stages and
>>> taking almost twice as long to finish.
>>> So I am pretty much at a loss here - could it be that it is not because
>>> of spark itself, but because of some difference introduced in the emr
>>> releases? At the moment I can't think of any other alternative besides it
>>> being a bug...
>>>
>>> Here are the two event logs:
>>>
>>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
>>> and my code is here:
>>> https://github.com/kgskgs/stars-spark3d
>>>
>>> I ran it like so on the clusters (after putting it on s3):
>>> spark-submit --deploy-mode cluster --py-files
>>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
>>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
>>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>>>
>>> So yeah I was considering submitting a bug report, but in the guide it
>>> said it's better to ask here first, so any ideas on what's going on? Maybe
>>> I am missing something?
>>>
>>> Regards,
>>> Kalin
>>>
>>


Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Xiao Li
EMR is having their own fork of Spark, called EMR runtime. They are not
Apache Spark. You might need to talk with them instead of posting questions
in the Apache Spark community.

Cheers,

Xiao

Kalin Stoyanov  于2020年1月15日周三 上午9:53写道:

> Hi all,
>
> First of all let me say that I am pretty new to Spark so this could be
> entirely my fault somehow...
> I noticed this when I was running a job on an amazon emr cluster with
> Spark 2.4.4, and it got done slower than when I had ran it locally (on
> Spark 2.4.1). I checked out the event logs, and the one from the newer
> version had more stages.
> Then I decided to do a comparison in the same environment so I created the
> two versions of the same cluster with the only difference being the emr
> release, and hence the spark version(?) - first one was emr-5.24.1 with
> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
> the same thing happened with the newer version having more stages and
> taking almost twice as long to finish.
> So I am pretty much at a loss here - could it be that it is not because of
> spark itself, but because of some difference introduced in the emr
> releases? At the moment I can't think of any other alternative besides it
> being a bug...
>
> Here are the two event logs:
>
> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
> and my code is here:
> https://github.com/kgskgs/stars-spark3d
>
> I ran it like so on the clusters (after putting it on s3):
> spark-submit --deploy-mode cluster --py-files
> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>
> So yeah I was considering submitting a bug report, but in the guide it
> said it's better to ask here first, so any ideas on what's going on? Maybe
> I am missing something?
>
> Regards,
> Kalin
>


Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Xiao Li
In the upcoming Spark 3.0, we introduced a new framework for Adaptive Query
Execution in Catalyst. This can adjust the plans based on the runtime
statistics. This is missing in Calcite based on my understanding.

Catalyst is also very easy to enhance. We also use the dynamic programming
approach in our cost-based join reordering. If needed, in the future, we
also can improve the existing CBO and make it more general. The paper of
Spark SQL was published 5 years ago. A lot of great contributions were made
in the past 5 years.

Cheers,

Xiao

Debajyoti Roy  于2020年1月15日周三 上午9:23写道:

> Thanks all, and Matei.
>
> TL;DR of the conclusion for my particular case:
> Qualitatively, while Catalyst[1] tries to mitigate learning curve and
> maintenance burden, it lacks the dynamic programming approach used by
> Calcite[2] and risks falling into local minima.
> Quantitatively, there is no reproducible benchmark, that fairly compares
> Optimizer frameworks, apples to apples (excluding execution).
>
> References:
> [1] -
> https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
> [2] - https://arxiv.org/pdf/1802.10233.pdf
>
> On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia 
> wrote:
>
>> I’m pretty sure that Catalyst was built before Calcite, or at least in
>> parallel. Calcite 1.0 was only released in 2015. From a technical
>> standpoint, building Catalyst in Scala also made it more concise and easier
>> to extend than an optimizer written in Java (you can find various
>> presentations about how Catalyst works).
>>
>> Matei
>>
>> > On Jan 13, 2020, at 8:41 AM, Michael Mior  wrote:
>> >
>> > It's fairly common for adapters (Calcite's abstraction of a data
>> > source) to push down predicates. However, the API certainly looks a
>> > lot different than Catalyst's.
>> > --
>> > Michael Mior
>> > mm...@apache.org
>> >
>> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
>> >  a écrit :
>> >>
>> >> The implementation they chose supports push down predicates, Datasets
>> and other features that are not available in Calcite:
>> >>
>> >> https://databricks.com/glossary/catalyst-optimizer
>> >>
>> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker  wrote:
>> >>>
>> >>> Was there a qualitative or quantitative benchmark done before a design
>> >>> decision was made not to use Calcite?
>> >>>
>> >>> Are there limitations (for heuristic based, cost based, * aware
>> optimizer)
>> >>> in Calcite, and frameworks built on top of Calcite? In the context of
>> big
>> >>> data / TCPH benchmarks.
>> >>>
>> >>> I was unable to dig up anything concrete from user group / Jira.
>> Appreciate
>> >>> if any Catalyst veteran here can give me pointers. Trying to defend
>> >>> Spark/Catalyst.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>>
>> >>
>> >>
>> >> --
>> >> Thanks,
>> >> Jason
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>>


Re: Fail to use SparkR of 3.0 preview 2

2020-01-07 Thread Xiao Li
We can use R version 3.6.1, if we have a concern about the quality of 3.6.2?

On Thu, Dec 26, 2019 at 8:14 PM Hyukjin Kwon  wrote:

> I was randomly googling out of curiosity, and seems indeed that's the
> problem (
> https://r.789695.n4.nabble.com/Error-in-rbind-info-getNamespaceInfo-env-quot-S3methods-quot-td4755490.html
> ).
> Yes, seems we should make sure we build SparkR in an old version.
> Since that support for R prior to version 3.4 is deprecated as of Spark
> 3.0.0, we could use either R 3.4 or matching to Jenkins's (R 3.1 IIRC) for
> Spark 3.0 release.
>
> Redirecting to a dev list and Yuming as well for visibility.
>
> 2019년 12월 27일 (금) 오후 12:02, Jeff Zhang 님이 작성:
>
>> Yes, I guess so. But R 3.6.2 is just released this month, I think we
>> should use an older version to build SparkR.
>>
>> Felix Cheung  于2019年12月27日周五 上午10:43写道:
>>
>>> Maybe it’s the reverse - the package is built to run in latest but not
>>> compatible with slightly older (3.5.2 was Dec 2018)
>>>
>>> --
>>> *From:* Jeff Zhang 
>>> *Sent:* Thursday, December 26, 2019 5:36:50 PM
>>> *To:* Felix Cheung 
>>> *Cc:* user.spark 
>>> *Subject:* Re: Fail to use SparkR of 3.0 preview 2
>>>
>>> I use R 3.5.2
>>>
>>> Felix Cheung  于2019年12月27日周五 上午4:32写道:
>>>
>>> It looks like a change in the method signature in R base packages.
>>>
>>> Which version of R are you running on?
>>>
>>> --
>>> *From:* Jeff Zhang 
>>> *Sent:* Thursday, December 26, 2019 12:46:12 AM
>>> *To:* user.spark 
>>> *Subject:* Fail to use SparkR of 3.0 preview 2
>>>
>>> I tried SparkR of spark 3.0 preview 2, but hit the following issue.
>>>
>>> Error in rbind(info, getNamespaceInfo(env, "S3methods")) :
>>>   number of columns of matrices must match (see arg 2)
>>> Error: package or namespace load failed for ‘SparkR’ in rbind(info,
>>> getNamespaceInfo(env, "S3methods")):
>>>  number of columns of matrices must match (see arg 2)
>>> During startup - Warning messages:
>>> 1: package ‘SparkR’ was built under R version 3.6.2
>>> 2: package ‘SparkR’ in options("defaultPackages") was not found
>>>
>>> Does anyone know what might be wrong ? Thanks
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

-- 
[image: Databricks Summit - Watch the talks]



Re: [ANNOUNCE] Announcing Apache Spark 3.0.0-preview2

2019-12-24 Thread Xiao Li
Thank you all. Happy Holidays!

Xiao

On Tue, Dec 24, 2019 at 12:53 PM Yuming Wang  wrote:

> Hi all,
>
> To enable wide-scale community testing of the upcoming Spark 3.0 release,
> the Apache Spark community has posted a new preview release of Spark 3.0.
> This preview is *not a stable release in terms of either API or
> functionality*, but it is meant to give the community early access to try
> the code that will become Spark 3.0. If you would like to test the release,
> please download it, and send feedback using either the mailing lists
>  or JIRA
> 
> .
>
> There are a lot of exciting new features added to Spark 3.0, including
> Dynamic Partition Pruning, Adaptive Query Execution, Accelerator-aware
> Scheduling, Data Source API with Catalog Supports, Vectorization in SparkR,
> support of Hadoop 3/JDK 11/Scala 2.12, and many more. For a full list of
> major features and changes in Spark 3.0.0-preview2, please check the thread(
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-feature-list-and-major-changes-td28050.html
>  and
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-2-td28491.html
> ).
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 3.0.0-preview2, head over to the download page:
> https://archive.apache.org/dist/spark/spark-3.0.0-preview2
>
> Happy Holidays.
>
> Yuming
>


-- 
[image: Databricks Summit - Watch the talks]



Happy Diwali everyone!!!

2019-10-27 Thread Xiao Li
Happy Diwali everyone!!!

Xiao


Re: JDK11 Support in Apache Spark

2019-08-24 Thread Xiao Li
Thank you for your contributions! This is a great feature for Spark 3.0! We
finally achieve it!

Xiao

On Sat, Aug 24, 2019 at 12:18 PM Felix Cheung 
wrote:

> That’s great!
>
> --
> *From:* ☼ R Nair 
> *Sent:* Saturday, August 24, 2019 10:57:31 AM
> *To:* Dongjoon Hyun 
> *Cc:* d...@spark.apache.org ; user @spark/'user
> @spark'/spark users/user@spark 
> *Subject:* Re: JDK11 Support in Apache Spark
>
> Finally!!! Congrats
>
> On Sat, Aug 24, 2019, 11:11 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Thanks to your many many contributions,
>> Apache Spark master branch starts to pass on JDK11 as of today.
>> (with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
>> (JDK11 is used for building and testing.)
>>
>> We already verified all UTs (including PySpark/SparkR) before.
>>
>> Please feel free to use JDK11 in order to build/test/run `master` branch
>> and
>> share your experience including any issues. It will help Apache Spark
>> 3.0.0 release.
>>
>> For the follow-ups, please follow
>> https://issues.apache.org/jira/browse/SPARK-24417 .
>> The next step is `how to support JDK8/JDK11 together in a single
>> artifact`.
>>
>> Bests,
>> Dongjoon.
>>
>

-- 
[image: Databricks Summit - Watch the talks]



Re: Filter cannot be pushed via a Join

2019-06-18 Thread Xiao Li
Hi, William,

Thanks for reporting it. Could you open a JIRA?

Cheers,

Xiao

William Wong  于2019年6月18日周二 上午8:57写道:

> BTW, I noticed a workaround is creating a custom rule to remove 'empty
> local relation' from a union table. However, I am not 100% sure if it is
> the right approach.
>
> On Tue, Jun 18, 2019 at 11:53 PM William Wong 
> wrote:
>
>> Dear all,
>>
>> I am not sure if it is something expected or not, and should I report it
>> as a bug.  Basically, the constraints of a union table could be turned
>> empty if any subtable is turned into an empty local relation. The side
>> effect is filter cannot be inferred correctly (by
>> InferFiltersFromConstrains)
>>
>> We may reproduce the issue with the following setup:
>> 1) Prepare two tables:
>> * spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string)
>> USING PARQUET");
>> * spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string)
>> USING PARQUET");
>>
>> 2) Create a union view on table1.
>> * spark.sql("""
>>  | CREATE VIEW partitioned_table_1 AS
>>  | SELECT * FROM table1 WHERE id = 'a'
>>  | UNION ALL
>>  | SELECT * FROM table1 WHERE id = 'b'
>>  | UNION ALL
>>  | SELECT * FROM table1 WHERE id = 'c'
>>  | UNION ALL
>>  | SELECT * FROM table1 WHERE id NOT IN ('a','b','c')
>>  | """.stripMargin)
>>
>> 3) View the optimized plan of this SQL. The filter 't2.id = 'a'' cannot
>> be inferred. We can see that the constraints of the left table are empty.
>>
>> scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE
>> t1.id = t2.id AND t1.id = 'a'").queryExecution.optimizedPlan
>> res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
>> Join Inner, (id#0 = id#4)
>> :- Union
>> :  :- Filter (isnotnull(id#0) && (id#0 = a))
>> :  :  +- Relation[id#0,val#1] parquet
>> :  :- LocalRelation , [id#0, val#1]
>> :  :- LocalRelation , [id#0, val#1]
>> :  +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a))
>> : +- Relation[id#0,val#1] parquet
>> +- Filter isnotnull(id#4)
>>+- Relation[id#4,val#5] parquet
>>
>> scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE
>> t1.id = t2.id AND t1.id =
>> 'a'").queryExecution.optimizedPlan.children(0).constraints
>> res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set()
>>
>> 4) Modified the query to avoid empty local relation. The filter 't2.id
>> in ('a','b','c','d')' is then inferred properly. The constraints of the
>> left table are not empty as well.
>>
>> scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE
>> t1.id = t2.id AND t1.id IN
>> ('a','b','c','d')").queryExecution.optimizedPlan
>> res42: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
>> Join Inner, (id#0 = id#4)
>> :- Union
>> :  :- Filter ((isnotnull(id#0) && (id#0 = a)) && id#0 IN (a,b,c,d))
>> :  :  +- Relation[id#0,val#1] parquet
>> :  :- Filter ((isnotnull(id#0) && (id#0 = b)) && id#0 IN (a,b,c,d))
>> :  :  +- Relation[id#0,val#1] parquet
>> :  :- Filter ((isnotnull(id#0) && (id#0 = c)) && id#0 IN (a,b,c,d))
>> :  :  +- Relation[id#0,val#1] parquet
>> :  +- Filter ((NOT id#0 IN (a,b,c) && id#0 IN (a,b,c,d)) &&
>> isnotnull(id#0))
>> : +- Relation[id#0,val#1] parquet
>> +- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) ||
>> (id#4 = b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4))
>>+- Relation[id#4,val#5] parquet
>>
>> scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE
>> t1.id = t2.id AND t1.id IN
>> ('a','b','c','d')").queryExecution.optimizedPlan.children(0).constraints
>> res44: org.apache.spark.sql.catalyst.expressions.ExpressionSet =
>> Set(isnotnull(id#0), id#0 IN (a,b,c,d), id#0 = a) || (id#0 = b)) ||
>> (id#0 = c)) || NOT id#0 IN (a,b,c)))
>>
>>
>> Thanks and regards,
>> William
>>
>>
>> On Sat, Jun 15, 2019 at 1:13 AM William Wong 
>> wrote:
>>
>>> Hi all,
>>>
>>> Appreciate any expert may help on this strange behavior..
>>>
>>> It is interesting that... I implemented a custom rule to remove empty
>>> LocalRelation children under Union and run the same query. The filter 'id =
>>> 'a' is inferred to the table2 and pushed via the Join.
>>>
>>> scala> spark2.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE
>>> t1.id = t2.id AND t1.id = 'a'").explain
>>> == Physical Plan ==
>>> *(4) BroadcastHashJoin [id#0], [id#4], Inner, BuildRight
>>> :- Union
>>> :  :- *(1) Project [id#0, val#1]
>>> :  :  +- *(1) Filter (isnotnull(id#0) && (id#0 = a))
>>> :  : +- *(1) FileScan parquet default.table1[id#0,val#1] Batched:
>>> true, Format: Parquet, Location:
>>> InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table1],
>>> PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,a)],
>>> ReadSchema: struct
>>> :  +- *(2) Project [id#0, val#1]
>>> : +- *(2) Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0
>>> = a))
>>> :+- *(2) FileScan parquet default.table1[id#0,val#1] Batched:

[ANNOUNCE] Announcing Apache Spark 2.4.3

2019-05-09 Thread Xiao Li
We are happy to announce the availability of Spark 2.4.3!

Spark 2.4.3 is a maintenance release containing stability fixes. This
release is based on the branch-2.4 maintenance branch of Spark. We strongly
recommend all 2.4 users to upgrade to this stable release.

Note that 2.4.3 switched the default Scala version from Scala 2.12 to Scala
2.11, which is the default for all the previous 2.x releases except 2.4.2.
That means, the pre-built convenience binaries are compiled for Scala 2.11.
Spark is still cross-published for 2.11 and 2.12 in Maven Central, and can
be built for 2.12 from source.

To download Spark 2.4.3, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-4-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Xiao Li


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Xiao Li
Try to clear your browsing data or use a different web browser.

Enjoy it,

Xiao

On Thu, Nov 8, 2018 at 4:15 PM Reynold Xin  wrote:

> Do you have a cached copy? I see it here
>
> http://spark.apache.org/downloads.html
>
>
>
> On Thu, Nov 8, 2018 at 4:12 PM Li Gao  wrote:
>
>> this is wonderful !
>> I noticed the official spark download site does not have 2.4 download
>> links yet.
>>
>> On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde > wrote:
>>
>>> Great news.. thank you very much!
>>>
>>> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
>>> stavros.kontopou...@lightbend.com wrote:
>>>
 Awesome!

 On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji 
 wrote:

> Indeed!
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
> wrote:
>
> Finally, thank you all. Especially, thanks to the release manager,
> Wenchen!
>
> Bests,
> Dongjoon.
>
>
> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan 
> wrote:
>
>> + user list
>>
>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan 
>> wrote:
>>
>>> resend
>>>
>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
>>> wrote:
>>>


 -- Forwarded message -
 From: Wenchen Fan 
 Date: Thu, Nov 8, 2018 at 10:55 PM
 Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
 To: Spark dev list 


 Hi all,

 Apache Spark 2.4.0 is the fifth release in the 2.x line. This
 release adds Barrier Execution Mode for better integration with deep
 learning frameworks, introduces 30+ built-in and higher-order 
 functions to
 deal with complex data type easier, improves the K8s integration, along
 with experimental Scala 2.12 support. Other major updates include the
 built-in Avro data source, Image data source, flexible streaming sinks,
 elimination of the 2GB block size limitation during transfer, Pandas 
 UDF
 improvements. In addition, this release continues to focus on 
 usability,
 stability, and polish while resolving around 1100 tickets.

 We'd like to thank our contributors and users for their
 contributions and early feedback to this release. This release would 
 not
 have been possible without you.

 To download Spark 2.4.0, head over to the download page:
 http://spark.apache.org/downloads.html

 To view the release notes:
 https://spark.apache.org/releases/spark-release-2-4-0.html

 Thanks,
 Wenchen

 PS: If you see any issues with the release notes, webpage or
 published artifacts, please contact me directly off-list.

>>>





-- 
[image: Spark+AI Summit North America 2019]



Happy Diwali everyone!!!

2018-11-07 Thread Xiao Li
Happy Diwali everyone!!!

Xiao Li


Re: How to use 'insert overwrite [local] directory' correctly?

2018-08-27 Thread Xiao Li
Open a JIRA?

Bang Xiao  于2018年8月27日周一 上午2:46写道:

> solve the problem by create directory on hdfs before execute the sql.
> but i met a new error when i use  :
>
> INSERT OVERWRITE LOCAL DIRECTORY '/search/odin/test' row format delimited
> FIELDS TERMINATED BY '\t' select vrid, query, url, loc_city from
> custom.common_wap_vr where logdate >= '2018073000' and logdate <=
> '2018073023' and vrid = '11000801' group by vrid,query, loc_city,url;
>
> spark command is : spark-sql --master yarn --deploy-mode client -f test.sql
>
> here is the stack track:
> 18/08/27 17:16:21 ERROR util.Utils: Aborting task
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException:
> Mkdirs failed to create
>
> file:/user/hive/datadir-tmp_hive_2018-08-27_17-14-45_908_2829491226961893146-1/-ext-1/_temporary/0/_temporary/attempt_20180827171619_0002_m_00_0
> (exists=false,
>
> cwd=file:/search/hadoop09/yarn_local/usercache/ultraman/appcache/application_1535079600137_133521/container_e09_1535079600137_133521_01_51)
> at
> org.apache.hadoop.hive.ql.io
> .HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
> at
>
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
> at
>
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
> at
>
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
> at
>
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
> at
>
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
> at
>
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
> at
>
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
> at
>
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
> at
>
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at
>
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Mkdirs failed to create
>
> file:/user/hive/datadir-tmp_hive_2018-08-27_17-14-45_908_2829491226961893146-1/-ext-1/_temporary/0/_temporary/attempt_20180827171619_0002_m_00_0
> (exists=false,
>
> cwd=file:/search/hadoop09/yarn_local/usercache/ultraman/appcache/application_1535079600137_133521/container_e09_1535079600137_133521_01_51)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:925)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:818)
> at
> org.apache.hadoop.hive.ql.io
> .HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
> at
> org.apache.hadoop.hive.ql.io
> .HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
> at
> org.apache.hadoop.hive.ql.io
> .HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
> ... 16 more
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

2018-03-27 Thread Xiao Li
Hi, Eirik,

Yes, please open a JIRA.

Thanks,

Xiao

2018-03-23 8:03 GMT-07:00 Eirik Thorsnes :

> Hi all,
>
> I'm trying the new ORC native in Spark 2.3
> (org.apache.spark.sql.execution.datasources.orc).
>
> I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.
> I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4.
>
> *NOTE*: the error only occurs with zlib compression, and I see that with
> Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec
> SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code?
>
> I can write using the new native codepath without errors, but *reading*
> zlib-compressed ORC, either the newly written ORC-files *or* older
> ORC-files written with Spark 2.2/1.6 I get the following exception.
>
> === cut =
> 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path:
> hdfs://.../year=1999/part-r-0-2573bfff-1f18-47d9-b0fb-
> 37dc216b8a99.orc,
> range: 0-134217728, partition values: [1999]
> 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from
> hdfs://.../year=1999/part-r-0-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc
> with {include: [true, true, true, true, true, true, true, true, true],
> offset: 0, length: 134217728}
> 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not
> provided -- using file schema
> struct v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint>
>
> 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage
> 1.0 (TID 1)
> java.nio.BufferUnderflowException
> at java.nio.Buffer.nextGetIndex(Buffer.java:500)
> at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
> at
> org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248)
> at
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(
> RunLengthIntegerReaderV2.java:58)
> at
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(
> RunLengthIntegerReaderV2.java:323)
> at
> org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.
> nextVector(TreeReaderFactory.java:976)
> at
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(
> TreeReaderFactory.java:1815)
> at
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
> at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.
> nextBatch(OrcColumnarBatchReader.scala:186)
> at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.
> nextKeyValue(OrcColumnarBatchReader.scala:114)
> at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(
> RecordReaderIterator.scala:39)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(
> FileScanRDD.scala:105)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.
> nextIterator(FileScanRDD.scala:177)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(
> FileScanRDD.scala:105)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.scan_nextBatch$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 2.apply(SparkPlan.scala:234)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 2.apply(SparkPlan.scala:228)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$25.apply(RDD.scala:827)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$25.apply(RDD.scala:827)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> === cut =
>
> I have the following set in spark-defaults.conf:
>
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true

Re: There is no UDF0 interface?

2018-02-04 Thread Xiao Li
The upcoming 2.3 will have it.

On Sun, Feb 4, 2018 at 12:24 PM kant kodali  wrote:

> Hi All,
>
> I see the current UDF API's can take one or more arguments but I don't see
> any UDF0 in Spark 2.2.0. am I correct?
>
> Thanks!
>


Re: spark 2.0 and spark 2.2

2018-01-22 Thread Xiao Li
Generally, the behavior changes in Spark SQL will be documented in
https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide

In the ongoing Spark 2.3 release, all the behavior changes in Spark
SQL/DataFrame/Dataset that causes behavior changes are documented in this
section.

Thanks,

Xiao

2018-01-22 7:07 GMT-08:00 Mihai Iacob :

> Does spark 2.2 have good backwards compatibility? Is there something that
> won't work that works in spark 2.0?
>
>
> Regards,
>
> *Mihai Iacob*
> DSX Local  - Security, IBM Analytics
>
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Spark 2.0 and Oracle 12.1 error

2017-07-21 Thread Xiao Li
Could you share the schema of your Oracle table and open a JIRA?

Thanks!

Xiao


2017-07-21 9:40 GMT-07:00 Cassa L <lcas...@gmail.com>:

> I am using 2.2.0. I resolved the problem by removing SELECT * and adding
> column names to the SELECT statement. That works. I'm wondering why SELECT
> * will not work.
>
> Regards,
> Leena
>
> On Fri, Jul 21, 2017 at 8:21 AM, Xiao Li <gatorsm...@gmail.com> wrote:
>
>> Could you try 2.2? We fixed multiple Oracle related issues in the latest
>> release.
>>
>> Thanks
>>
>> Xiao
>>
>>
>> On Wed, 19 Jul 2017 at 11:10 PM Cassa L <lcas...@gmail.com> wrote:
>>
>>> Hi,
>>> I am trying to use Spark to read from Oracle (12.1) table using Spark
>>> 2.0. My table has JSON data.  I am getting below exception in my code. Any
>>> clue?
>>>
>>> >>>>>
>>> java.sql.SQLException: Unsupported type -101
>>>
>>> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.
>>> org$apache$spark$sql$execution$datasources$jdbc$
>>> JdbcUtils$$getCatalystType(JdbcUtils.scala:233)
>>> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$
>>> anonfun$8.apply(JdbcUtils.scala:290)
>>> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$
>>> anonfun$8.apply(JdbcUtils.scala:290)
>>> at scala.Option.getOrElse(Option.scala:121)
>>> at
>>>
>>> ==
>>> My code is very simple.
>>>
>>> SparkSession spark = SparkSession
>>> .builder()
>>> .appName("Oracle Example")
>>> .master("local[4]")
>>> .getOrCreate();
>>>
>>> final Properties connectionProperties = new Properties();
>>> connectionProperties.put("user", *"some_user"*));
>>> connectionProperties.put("password", "some_pwd"));
>>>
>>> final String dbTable =
>>> "(select *  from  MySampleTable)";
>>>
>>> Dataset jdbcDF = spark.read().jdbc(*URL*, dbTable, 
>>> connectionProperties);
>>>
>>>
>


Re: Spark 2.0 and Oracle 12.1 error

2017-07-21 Thread Xiao Li
Could you try 2.2? We fixed multiple Oracle related issues in the latest
release.

Thanks

Xiao


On Wed, 19 Jul 2017 at 11:10 PM Cassa L  wrote:

> Hi,
> I am trying to use Spark to read from Oracle (12.1) table using Spark 2.0.
> My table has JSON data.  I am getting below exception in my code. Any clue?
>
> >
> java.sql.SQLException: Unsupported type -101
>
> at
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:233)
> at
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
> at
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
> at scala.Option.getOrElse(Option.scala:121)
> at
>
> ==
> My code is very simple.
>
> SparkSession spark = SparkSession
> .builder()
> .appName("Oracle Example")
> .master("local[4]")
> .getOrCreate();
>
> final Properties connectionProperties = new Properties();
> connectionProperties.put("user", *"some_user"*));
> connectionProperties.put("password", "some_pwd"));
>
> final String dbTable =
> "(select *  from  MySampleTable)";
>
> Dataset jdbcDF = spark.read().jdbc(*URL*, dbTable, connectionProperties);
>
>


Re: Remove .HiveStaging files

2017-02-16 Thread Xiao Li
Maybe you can check this PR?

https://github.com/apache/spark/pull/16399

Thanks,

Xiao


2017-02-15 15:05 GMT-08:00 KhajaAsmath Mohammed :

> Hi,
>
> I am using spark temporary tables to write data back to hive. I have seen
> weird behavior of .hive-staging files after job completion. does anyone
> know how to delete them or dont get created while writing data into hive.
>
> Thanks,
> Asmath
>


Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread Xiao Li
Which Spark version are you using?

2017-02-06 12:25 GMT-05:00 vaquar khan :

> Did you try  MSCK REPAIR TABLE  ?
>
> Regards,
> Vaquar Khan
>
> On Feb 6, 2017 11:21 AM, "KhajaAsmath Mohammed" 
> wrote:
>
>> I dont think so, i was able to insert overwrite other created tables in
>> hive using spark sql. The only problem  I am facing is, spark is not able
>> to recognize hive view name. Very strange but not sure where I am doing
>> wrong in this.
>>
>> On Mon, Feb 6, 2017 at 11:03 AM, Jon Gregg  wrote:
>>
>>> Confirming that Spark can read newly created views - I just created a
>>> test view in HDFS and I was able to query it in Spark 1.5 immediately after
>>> without a refresh.  Possibly an issue with your Spark-Hive connection?
>>>
>>> Jon
>>>
>>> On Sun, Feb 5, 2017 at 9:31 PM, KhajaAsmath Mohammed <
>>> mdkhajaasm...@gmail.com> wrote:
>>>
 Hi Khan,

 It didn't work in my case. used below code. View is already present in
 Hive but I cant read that in spark sql. Throwing exception that table not
 found

 sqlCtx.refreshTable("schema.hive_view")


 Thanks,

 Asmath


 On Sun, Feb 5, 2017 at 7:56 PM, vaquar khan 
 wrote:

> Hi Ashmath,
>
> Try  refresh table
>
> // spark is an existing SparkSession
> spark.catalog.refreshTable("my_table")
>
>
> http://spark.apache.org/docs/latest/sql-programming-guide.ht
> ml#metadata-refreshing
>
>
>
> Regards,
> Vaquar khan
>
>
>
> On Sun, Feb 5, 2017 at 7:19 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a hive view which is basically set of select statements on
>> some tables. I want to read the hive view and use hive builtin functions
>> available in spark sql.
>>
>> I am not able to read that hive view in spark sql but can retreive
>> data in hive shell.
>>
>> can't spark access hive views?
>>
>> Thanks,
>> Asmath
>>
>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783 <(224)%20436-0783>
>
> IT Architect / Lead Consultant
> Greater Chicago
>


>>>
>>


Re: Spark 2.0 issue

2016-09-29 Thread Xiao Li
Hi, Ashish,

Will take a look at this soon.

Thanks for reporting this,

Xiao

2016-09-29 14:26 GMT-07:00 Ashish Shrowty :
> If I try to inner-join two dataframes which originated from the same initial
> dataframe that was loaded using spark.sql() call, it results in an error -
>
> // reading from Hive .. the data is stored in Parquet format in Amazon
> S3
> val d1 = spark.sql("select * from ")
> val df1 =
> d1.groupBy("key1","key2").agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2").agg(avg("itemcount").as("avgqty"))
> df1.join(df2, Seq("key1","key2")) gives error -
>  org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
>
> If the same Dataframe is initialized via spark.read.parquet(), the above
> code works. This same code above also worked with Spark 1.6.2. I created a
> JIRA too ..  SPARK-17709 
>
> Any help appreciated!
>
> Thanks,
> Ashish
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-issue-tp27818.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: What are using Spark for

2016-08-01 Thread Xiao Li
Hi, Rohit,

The Spark summit has many interesting use cases. Hopefully, it can
answer your question.

https://spark-summit.org/2015/schedule/
https://spark-summit.org/2016/schedule/

Thanks,

Xiao

2016-08-01 22:48 GMT-07:00 Rohit L :
> Hi Everyone,
>
>   I want to know the real world uses cases for which Spark is used and
> hence can you please share for what purpose you are using Apache Spark in
> your project?
>
> --
> Rohit

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Difference between DataFrame.write.jdbc and DataFrame.write.format("jdbc")

2016-07-06 Thread Xiao Li
Hi, Dragisa,

Just submitted a PR for implementing the save API.
https://github.com/apache/spark/pull/14077

Let me know if you have any question,

Xiao

2016-07-06 10:41 GMT-07:00 Rabin Banerjee :

> HI Buddy,
>
>I sued both but DataFrame.write.jdbc is old, and will work if provide
> table name , It wont work if you provide custom queries . Where
> as DataFrame.write.format is more generic as well as working perfectly with
> not only table name but also custom queries . Hence I recommend to use
> the DataFrame.write.format("jdbc") .
>
> Cheers !
> Rabin
>
>
>
> On Wed, Jul 6, 2016 at 10:35 PM, Dragisa Krsmanovic <
> dragi...@ticketfly.com> wrote:
>
>> I was expecting to get the same results with both:
>>
>> dataFrame.write.mode(SaveMode.Overwrite).jdbc(dbUrl, "my_table", props)
>>
>> and
>>
>> dataFrame.write.mode(SaveMode.Overwrite).format("jdbc").options(opts).option("dbtable",
>> "my_table")
>>
>>
>> In the first example, it behaves as expected. It creates a new table and
>> populates it with the rows from DataFrame.
>>
>> In the second case, I get exception:
>> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not
>> allow create table as select.
>>
>> Looking at the Spark source, it looks like there is a completely separate
>> implementation for format("jdbc") and for jdbc(...).
>>
>> I find that confusing. Unfortunately documentation is rather sparse and
>> one finds this discrepancy only through trial and error.
>>
>> Is there a plan to deprecate one of the forms ? Or to allow same
>> functionality for both ?
>>
>> I tried both 1.6 and 2.0-preview
>> --
>>
>> Dragiša Krsmanović | Platform Engineer | Ticketfly
>>
>> dragi...@ticketfly.com
>>
>> @ticketfly  | ticketfly.com/blog |
>> facebook.com/ticketfly
>>
>
>


Re: Difference between DataFrame.write.jdbc and DataFrame.write.format("jdbc")

2016-07-06 Thread Xiao Li
Hi, Dragisa,

Your second way is incomplete, right? To get the error you showed, you need
to put save() there.

Yeah, we can implement the trait CreatableRelationProvider for JDBC. Then,
you will not see that error.

Will submit a PR for that.

Thanks,

Xiao


2016-07-06 10:05 GMT-07:00 Dragisa Krsmanovic :

> I was expecting to get the same results with both:
>
> dataFrame.write.mode(SaveMode.Overwrite).jdbc(dbUrl, "my_table", props)
>
> and
>
> dataFrame.write.mode(SaveMode.Overwrite).format("jdbc").options(opts).option("dbtable",
> "my_table")
>
>
> In the first example, it behaves as expected. It creates a new table and
> populates it with the rows from DataFrame.
>
> In the second case, I get exception:
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not
> allow create table as select.
>
> Looking at the Spark source, it looks like there is a completely separate
> implementation for format("jdbc") and for jdbc(...).
>
> I find that confusing. Unfortunately documentation is rather sparse and
> one finds this discrepancy only through trial and error.
>
> Is there a plan to deprecate one of the forms ? Or to allow same
> functionality for both ?
>
> I tried both 1.6 and 2.0-preview
> --
>
> Dragiša Krsmanović | Platform Engineer | Ticketfly
>
> dragi...@ticketfly.com
>
> @ticketfly  | ticketfly.com/blog |
> facebook.com/ticketfly
>


Re: Copying all Hive tables from Prod to UAT

2016-04-08 Thread Xiao Li
You also need to ensure no workload is running on both sides.

2016-04-08 15:54 GMT-07:00 Ali Gouta :

> For hive, you may use sqoop to achieve this. In my opinion, you may also
> run a spark job to make it..
> Le 9 avr. 2016 00:25, "Ashok Kumar"  a
> écrit :
>
> Hi,
>
> Anyone has suggestions how to create and copy Hive and Spark tables from
> Production to UAT.
>
> One way would be to copy table data to external files and then move the
> external files to a local target directory and populate the tables in
> target Hive with data.
>
> Is there an easier way of doing so?
>
> thanks
>
>
>


Re: Is Hive CREATE DATABASE IF NOT EXISTS atomic

2016-04-07 Thread Xiao Li
Hi,

Assuming you are using 1.6 or before, this is a native Hive command.

Basically, the execution of Database creation is completed by Hive.

Thanks,

Xiao Li

2016-04-07 15:23 GMT-07:00 antoniosi <antonio...@gmail.com>:

> Hi,
>
> I am using hiveContext.sql("create database if not exists ") to
> create a hive db. Is this statement atomic?
>
> Thanks.
>
> Antonio.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Hive-CREATE-DATABASE-IF-NOT-EXISTS-atomic-tp26706.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL Optimization

2016-03-21 Thread Xiao Li
Hi, Maybe you can open a JIRA and upload your plan as Michael suggested.
This is an interesting feature. Thanks!

Xiao Li

2016-03-21 10:36 GMT-07:00 Michael Armbrust <mich...@databricks.com>:

> It's helpful if you can include the output of EXPLAIN EXTENDED or
> df.explain(true) whenever asking about query performance.
>
> On Mon, Mar 21, 2016 at 6:27 AM, gtinside <gtins...@gmail.com> wrote:
>
>> Hi ,
>>
>> I am trying to execute a simple query with join on 3 tables. When I look
>> at
>> the execution plan , it varies with position of table in the "from"
>> clause.
>> Execution plan looks more optimized when the position of table with
>> predicates is specified before any other table.
>>
>>
>> Original query :
>>
>> select distinct pge.portfolio_code
>> from table1 pge join table2 p
>> on p.perm_group = pge.anc_port_group
>> join table3 uge
>> on p.user_group=uge.anc_user_group
>> where uge.user_name = 'user' and p.perm_type = 'TEST'
>>
>> Optimized query (table with predicates is moved ahead):
>>
>> select distinct pge.portfolio_code
>> from table1 uge, table2 p, table3 pge
>> where uge.user_name = 'user' and p.perm_type = 'TEST'
>> and p.perm_group = pge.anc_port_group
>> and p.user_group=uge.anc_user_group
>>
>>
>> Execution plan is more optimized for the optimized query and hence the
>> query
>> executes faster. All the tables are being sourced from parquet files
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: how to interview spark developers

2016-02-23 Thread Xiao Li
This is interesting! I believe the interviewees should AT LEAST subscribe
this mailing list, if they are spark developers.

Then, they will know your questions before the interview. : )

2016-02-23 22:07 GMT-08:00 charles li :

> hi, there, we are going to recruit several spark developers, can some one
> give some ideas on interviewing candidates, say, spark related problems.
>
>
> great thanks.
>
> --
> *--*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>


Re: how to introduce spark to your colleague if he has no background about *** spark related

2016-01-31 Thread Xiao Li
My 2 cents. Concepts are always boring to the people with zero background.
Use examples to show how easy and powerful Spark is! Use cases are also
useful for them. Downloaded the slides in Spark summit. I believe you can
find a lot of interesting ideas!

Tomorrow, I am facing similar issues, but the audiences are three RDBMS
engine experts. I will go over the paper Spark SQL in Sigmod 2015 with them
and show them the source codes.

Good luck!

Xiao Li

2016-01-31 22:35 GMT-08:00 Jörn Franke <jornfra...@gmail.com>:

> It depends of course on the background of the people but how about some
> examples ("word count") how it works in the background.
>
> On 01 Feb 2016, at 07:31, charles li <charles.up...@gmail.com> wrote:
>
>
> *Apache Spark™* is a fast and general engine for large-scale data
> processing.
>
> it's a good profile of spark, but it's really too short for lots of people
> if then have little background in this field.
>
> ok, frankly, I'll give a tech-talk about spark later this week, and now
> I'm writing a slide about that, but I'm stuck at the first slide.
>
>
> I'm going to talk about three question about spark in the first part of my
> talk, for most of my colleagues has no background on spark, hadoop, so I
> want to talk :
>
> 1. the background of birth of spark
> 2. pros and cons of spark, or the situations that spark is going to
> handle, or why we use spark
> 3. the basic principles of spark,
> 4. the basic conceptions of spark
>
> have anyone met kinds of this problem, introduce spark to one who has no
> background on your field? and I hope you can tell me how you handle this
> problem at that time, or give some ideas about the 4 sections mentioned
> above.
>
>
> great thanks.
>
>
> --
> *--*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>
>


Re: Spark SQL IN Clause

2015-12-04 Thread Xiao Li
https://github.com/apache/spark/pull/9055

This JIRA explains how to convert IN to Joins.

Thanks,

Xiao Li



2015-12-04 11:27 GMT-08:00 Michael Armbrust <mich...@databricks.com>:

> The best way to run this today is probably to manually convert the query
> into a join.  I.e. create a dataframe that has all the numbers in it, and
> join/outer join it with the other table.  This way you avoid parsing a
> gigantic string.
>
> On Fri, Dec 4, 2015 at 10:36 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Have you seen this JIRA ?
>>
>> [SPARK-8077] [SQL] Optimization for TreeNodes with large numbers of
>> children
>>
>> From the numbers Michael published, 1 million numbers would still need
>> 250 seconds to parse.
>>
>> On Fri, Dec 4, 2015 at 10:14 AM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> How to use/best practices "IN" clause in Spark SQL.
>>>
>>> Use Case :-  Read the table based on number. I have a List of numbers.
>>> For example, 1million.
>>>
>>> Regards,
>>> Rajesh
>>>
>>
>>
>


Re: Low Latency SQL query

2015-12-01 Thread Xiao Li
http://cacm.acm.org/magazines/2011/6/108651-10-rules-for-scalable-performance-in-simple-operation-datastores/fulltext

Try to read this article. It might help you understand your problem.

Thanks,

Xiao Li

2015-12-01 16:36 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>:

> I'd ask another question first: If your SQL query can be executed in a
> performant fashion against a conventional (RDBMS?) database, why are you
> trying to use Spark?  How you answer that question will be the key to
> deciding among the engineering design tradeoffs to effectively use Spark or
> some other solution.
>
> On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>
>> Ok, so latency problem is being generated because I'm using SQL as
>> source? how about csv, hive, or another source?
>>
>> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>>> It is not designed for interactive queries.
>>>
>>>
>>> You might want to ask the designers of Spark, Spark SQL, and
>>> particularly some things built on top of Spark (such as BlinkDB) about
>>> their intent with regard to interactive queries.  Interactive queries are
>>> not the only designed use of Spark, but it is going too far to claim that
>>> Spark is not designed at all to handle interactive queries.
>>>
>>> That being said, I think that you are correct to question the wisdom of
>>> expecting lowest-latency query response from Spark using SQL (sic,
>>> presumably a RDBMS is intended) as the datastore.
>>>
>>> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> Hmm it will never be faster than SQL if you use SQL as an underlying
>>>> storage. Spark is (currently) an in-memory batch engine for iterative
>>>> machine learning workloads. It is not designed for interactive queries.
>>>> Currently hive is going into the direction of interactive queries.
>>>> Alternatives are Hbase on Phoenix or Impala.
>>>>
>>>> On 01 Dec 2015, at 21:58, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>>
>>>> Yes,
>>>> The use case would be,
>>>> Have spark in a service (I didnt invertigate this yet), through api
>>>> calls of this service we perform some aggregations over data in SQL, We are
>>>> already doing this with an internal development
>>>>
>>>> Nothing complicated, for instance, a table with Product, Product
>>>> Family, cost, price, etc. Columns like Dimension and Measures,
>>>>
>>>> I want to Spark for query that table to perform a kind of rollup, with
>>>> cost as Measure and Prodcut, Product Family as Dimension
>>>>
>>>> Only 3 columns, it takes like 20s to perform that query and the
>>>> aggregation, the  query directly to the database with a grouping at the
>>>> columns takes like 1s
>>>>
>>>> regards
>>>>
>>>>
>>>>
>>>> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke <jornfra...@gmail.com>
>>>> wrote:
>>>>
>>>>> can you elaborate more on the use case?
>>>>>
>>>>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi <iaiva...@gmail.com> wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > I'd like to use spark to perform some transformations over data
>>>>> stored inSQL, but I need low Latency, I'm doing some test and I run into
>>>>> spark context creation and data query over SQL takes too long time.
>>>>> >
>>>>> > Any idea for speed up the process?
>>>>> >
>>>>> > regards.
>>>>> >
>>>>> > --
>>>>> > Ing. Ivaldi Andres
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ing. Ivaldi Andres
>>>>
>>>>
>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Xiao Li
In your case, maybe you can try to call the function coalesce?

Good luck,

Xiao Li

2015-11-23 12:15 GMT-08:00 Andy Davidson <a...@santacruzintegration.com>:

> Hi Sabarish
>
> I am but a simple padawan :-) I do not understand your answer. Why would
> Spark be creating so many empty partitions? My real problem is my
> application is very slow. I happened to notice thousands of empty files
> being created. I thought this is a hint to why my app is slow.
>
> My program calls sample( 0.01).filter(not null).saveAsTextFile(). This
> takes about 35 min, to scan 500,000 JSON strings and write 5000 to disk.
> The total data writing in 38M.
>
> The data is read from HDFS. My understanding is Spark can not know in
> advance how HDFS partitioned the data. Spark knows I have a master and 3
> slaves machines. It knows how many works/executors are assigned to my Job.
> I would expect spark would be smart enough not create more partitions than
> I have worker machines?
>
> Also given I am not using any key/value operations like Join() or doing
> multiple scans I would assume my app would not benefit from partitioning.
>
>
> Kind regards
>
> Andy
>
>
> From: Sabarish Sasidharan <sabarish.sasidha...@manthan.com>
> Date: Saturday, November 21, 2015 at 7:20 PM
> To: Andrew Davidson <a...@santacruzintegration.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: newbie : why are thousands of empty files being created on
> HDFS?
>
> Those are empty partitions. I don't see the number of partitions specified
> in code. That then implies the default parallelism config is being used and
> is set to a very high number, the sum of empty + non empty files.
>
> Regards
> Sab
> On 21-Nov-2015 11:59 pm, "Andy Davidson" <a...@santacruzintegration.com>
> wrote:
>
>> I start working on a very simple ETL pipeline for a POC. It reads a in a
>> data set of tweets stored as JSON strings on in HDFS and randomly selects
>> 1% of the observations and writes them to HDFS. It seems to run very
>> slowly. E.G. To write 4720 observations takes 1:06:46.577795. I
>> Also noticed that RDD saveAsTextFile is creating thousands of empty
>> files.
>>
>> I assume creating all these empty files must be slowing down the system. Any
>> idea why this is happening? Do I have write a script to periodical remove
>> empty files?
>>
>>
>> Kind regards
>>
>> Andy
>>
>> tweetStrings = sc.textFile(inputDataURL)
>>
>> def removeEmptyLines(line) :
>> if line:
>> return True
>> else :
>> emptyLineCount.add(1);
>> return False
>>
>> emptyLineCount = sc.accumulator(0)
>> sample = (tweetStrings.filter(removeEmptyLines)
>>   .sample(withReplacement=False, fraction=0.01, seed=345678))
>>
>> startTime = datetime.datetime.now()
>> sample.saveAsTextFile(saveDataURL)
>>
>> endTime = datetime.datetime.now()
>> print("elapsed time:%s" % (datetime.datetime.now() - startTime))
>>
>> elapsed time:1:06:46.577795
>>
>>
>>
>> *Total number of empty files*
>>
>> $ hadoop fs -du {saveDataURL} | grep '^0' | wc –l
>>
>> 223515
>>
>> *Total number of files with data*
>>
>> $ hadoop fs -du {saveDataURL} | grep –v '^0' | wc –l
>>
>> 4642
>>
>>
>> I randomly pick a part file. It’s size is 9251
>>
>>


Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Xiao Li
Let me share my understanding.

If we view Spark as analytics OS, RDD APIs are like OS system calls. These
low-level system calls can be called in the program languages like C.
DataFrame and Dataset APIs are like higher-level programming languages.
They hide the low level complexity and the compiler (i.e., Catalyst) will
optimize your programs. For most users, the SQL, DataFrame and Dataset APIs
are good enough to satisfy most requirements.

In the near future, I guess GUI interfaces of Spark will be available soon.
Spark users (e.g, CEOs) might not need to know what are RDDs at all. They
can analyze their data by clicking a few buttons, instead of writing the
programs. : )

Wish Spark will be the most popular analytics OS in the world! : )

Have a good holiday everyone!

Xiao Li



2015-11-23 17:56 GMT-08:00 Jakob Odersky <joder...@gmail.com>:

> Thanks Michael, that helped me a lot!
>
> On 23 November 2015 at 17:47, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Here is how I view the relationship between the various components of
>> Spark:
>>
>>  - *RDDs - *a low level API for expressing DAGs that will be executed in
>> parallel by Spark workers
>>  - *Catalyst -* an internal library for expressing trees that we use to
>> build relational algebra and expression evaluation.  There's also an
>> optimizer and query planner than turns these into logical concepts into RDD
>> actions.
>>  - *Tungsten -* an internal optimized execution engine that can compile
>> catalyst expressions into efficient java bytecode that operates directly on
>> serialized binary data.  It also has nice low level data structures /
>> algorithms like hash tables and sorting that operate directly on serialized
>> data.  These are used by the physical nodes that are produced by the query
>> planner (and run inside of RDD operation on workers).
>>  - *DataFrames - *a user facing API that is similar to SQL/LINQ for
>> constructing dataflows that are backed by catalyst logical plans
>>  - *Datasets* - a user facing API that is similar to the RDD API for
>> constructing dataflows that are backed by catalyst logical plans
>>
>> So everything is still operating on RDDs but I anticipate most users will
>> eventually migrate to the higher level APIs for convenience and automatic
>> optimization
>>
>> On Mon, Nov 23, 2015 at 4:18 PM, Jakob Odersky <joder...@gmail.com>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm doing some reading-up on all the newer features of Spark such as
>>> DataFrames, DataSets and Project Tungsten. This got me a bit confused on
>>> the relation between all these concepts.
>>>
>>> When starting to learn Spark, I read a book and the original paper on
>>> RDDs, this lead me to basically think "Spark == RDDs".
>>> Now, looking into DataFrames, I read that they are basically
>>> (distributed) collections with an associated schema, thus enabling
>>> declarative queries and optimization (through Catalyst). I am uncertain how
>>> DataFrames relate to RDDs: are DataFrames transformed to operations on RDDs
>>> once they have been optimized? Or are they completely different concepts?
>>> In case of the latter, do DataFrames still use the Spark scheduler and get
>>> broken down into a DAG of stages and tasks?
>>>
>>> Regarding project Tungsten, where does it fit in? To my understanding it
>>> is used to efficiently cache data in memory and may also be used to
>>> generate query code for specialized hardware. This sounds as though it
>>> would work on Spark's worker nodes, however it would also only work with
>>> schema-associated data (aka DataFrames), thus leading me to the conclusion
>>> that RDDs and DataFrames do not share a common backend which in turn
>>> contradicts my conception of "Spark == RDDs".
>>>
>>> Maybe I missed the obvious as these questions seem pretty basic, however
>>> I was unable to find clear answers in Spark documentation or related papers
>>> and talks. I would greatly appreciate any clarifications.
>>>
>>> thanks,
>>> --Jakob
>>>
>>
>>
>


Re: How 'select name,age from TBL_STUDENT where age = 37' is optimized when caching it

2015-11-16 Thread Xiao Li
Your dataframe is cached. Thus, your plan is stored as an InMemoryRelation.

You can read the logics in CacheManager.scala.

Good luck,

Xiao Li

2015-11-16 6:35 GMT-08:00 Todd <bit1...@163.com>:

> Hi,
>
> When I cache the dataframe and run the query,
>
> val df = sqlContext.sql("select  name,age from TBL_STUDENT where age =
> 37")
> df.cache()
> df.show
> println(df.queryExecution)
>
>  I got the following execution plan,from the optimized logical plan,I can
> see the whole analyzed logical plan is totally replaced with the
> InMemoryRelation logical plan. But when I look into the Optimizer, I didn't
> see any optimizer that relates to the InMemoryRelation.
>
> Could you please explain how the optimization works?
>
>
>
>
> == Parsed Logical Plan ==
> ', argString:< [unresolvedalias(UnresolvedAttribute:
> 'name),unresolvedalias(UnresolvedAttribute: 'age)]>
>   ', argString:< (UnresolvedAttribute: 'age = 37)>
> ', argString:< [TBL_STUDENT], None>
>
> == Analyzed Logical Plan ==
> name: string, age: int
> , argString:<
> [AttributeReference:name#1,AttributeReference:age#3]>
>   , argString:< (AttributeReference:age#3 = 37)>
> , argString:< TBL_STUDENT>
>   , argString:<
> [AttributeReference:id#0,AttributeReference:name#1,AttributeReference:classId#2,AttributeReference:age#3],
> MapPartitionsRDD[4] at main at NativeMethodAccessorImpl.java:-2>
>
> == Optimized Logical Plan ==
> , argString:<
> [AttributeReference:name#1,AttributeReference:age#3], true, 1,
> StorageLevel(true, true, false, true, 1), (, argString:<
> [AttributeReference:name#1,AttributeReference:age#3]>), None>
>
> == Physical Plan ==
> , argString:<
> [AttributeReference:name#1,AttributeReference:age#3], (,
> argString:< [AttributeReference:name#1,AttributeReference:age#3], true,
> 1, StorageLevel(true, true, false, true, 1), (,
> argString:< [AttributeReference:name#1,AttributeReference:age#3]>), None>)>
>
>
>
>
>
>
>
>
>
>
>


Re: [spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-23 Thread Xiao Li
Hi, Sebastian,

To use private APIs, you have to be very familiar with the code path;
otherwise, it is very easy to hit an exception or a bug.

My suggestion is to use IntelliJ to step-by-step step in the
function hiveContext.sql until you hit the parseSql API. Then, you will
know if you have to call the other APIs before calling this API.

Note, lazy evaluation is a little bit annoying when you traverse the code
base.

Good luck,

Xiao Li


2015-10-21 3:06 GMT-07:00 Sebastian Nadorp <sebastian.nad...@nugg.ad>:

> What we're trying to achieve is a fast way of testing the validity of our
> SQL queries within Unit tests without going through the time consuming task
> of setting up an Hive Test Context.
> If there is any way to speed this step up, any help would be appreciated.
>
> Thanks,
> Sebastian
>
> *Sebastian Nadorp*
> Software Developer
>
> nugg.ad AG - Predictive Behavioral Targeting
> Rotherstraße 16 - 10245 Berlin
>
> sebastian.nad...@nugg.ad
>
> www.nugg.ad * http://blog.nugg.ad/ * www.twitter.com/nuggad *
> www.facebook.com/nuggad
>
> *Registergericht/District court*: Charlottenburg HRB 102226 B
> *Vorsitzender des Aufsichtsrates/Chairman of the supervisory board: *Dr.
> Detlev Ruland
> *Vorstand/Executive board:* Martin Hubert
>
> *nugg.ad <http://nugg.ad/> is a company of Deutsche Post DHL.*
>
>
> On Tue, Oct 20, 2015 at 9:20 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>
>> Just curious why you are using parseSql APIs?
>>
>> It works well if you use the external APIs. For example, in your case:
>>
>> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `t`(`id` STRING,
>> `foo` INT) PARTITIONED BY (year INT, month INT, day INT) STORED AS PARQUET
>> Location 'temp'")
>>
>> Good luck,
>>
>> Xiao Li
>>
>>
>> 2015-10-20 10:23 GMT-07:00 Michael Armbrust <mich...@databricks.com>:
>>
>>> Thats not really intended to be a public API as there is some internal
>>> setup that needs to be done for Hive to work.  Have you created a
>>> HiveContext in the same thread?  Is there more to that stacktrace?
>>>
>>> On Tue, Oct 20, 2015 at 2:25 AM, Ayoub <benali.ayoub.i...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> when upgrading to spark 1.5.1 from 1.4.1 the following code crashed on
>>>> runtime. It is mainly used to parse HiveQL queries and check that they
>>>> are
>>>> valid.
>>>>
>>>> package org.apache.spark.sql.hive
>>>>
>>>> val sql = "CREATE EXTERNAL TABLE IF NOT EXISTS `t`(`id` STRING, `foo`
>>>> INT)
>>>> PARTITIONED BY (year INT, month INT, day INT) STORED AS PARQUET Location
>>>> 'temp'"
>>>>
>>>> HiveQl.parseSql(sql)
>>>>
>>>> org.apache.spark.sql.AnalysisException: null;
>>>> at
>>>> org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
>>>> at
>>>>
>>>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>>>> at
>>>>
>>>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
>>>> at
>>>> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>>>> at
>>>> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>>>> at
>>>>
>>>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>>>> at
>>>>
>>>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>>>> at
>>>> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>>>> at
>>>>
>>>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>>>> at
>>>>
>>>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>>>> at
>>>> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>>>> at
>>>>
>>>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>>>> at
>>>>
>>>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.

Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-10-22 Thread Xiao Li
A few months ago, I used the DB2 jdbc drivers. I hit a couple of issues
when using --driver-class-path. At the end, I used the following command to
bypass most of issues:

./bin/spark-submit --jars
/Users/smile/db2driver/db2jcc.jar,/Users/smile/db2driver/db2jcc_license_cisuz.jar
--master local[*] --class com.sparkEngine.
/Users/smile/spark-1.3.1-bin-hadoop2.3/projects/SparkApps-master/spark-load-from-db/target/-1.0.jar

Hopefully, it works for you.

Xiao Li


2015-10-22 4:56 GMT-07:00 Akhil Das <ak...@sigmoidanalytics.com>:

> Did you try passing the mysql connector jar through --driver-class-path
>
> Thanks
> Best Regards
>
> On Sat, Oct 17, 2015 at 6:33 AM, Hurshal Patel <hpatel...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I've been struggling with a particularly puzzling issue after upgrading
>> to Spark 1.5.1 from Spark 1.4.1.
>>
>> When I use the MySQL JDBC connector and an exception (e.g.
>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException) is thrown on
>> the executor, I get a ClassNotFoundException on the driver, which results
>> in this error (logs are abbreviated):
>>
>> 15/10/16 17:20:59 INFO SparkContext: Starting job: collect at
>> repro.scala:73
>> ...
>> 15/10/16 17:20:59 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
>> 15/10/16 17:20:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID
>> 3)
>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException
>> at repro.Repro$$anonfun$main$3.apply$mcZI$sp(repro.scala:69)
>> ...
>> 15/10/16 17:20:59 WARN ThrowableSerializationWrapper: Task exception
>> could not be deserialized
>> java.lang.ClassNotFoundException:
>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> ...
>> 15/10/16 17:20:59 ERROR TaskResultGetter: Could not deserialize
>> TaskEndReason: ClassNotFound with classloader
>> org.apache.spark.util.MutableURLClassLoader@7f08a6b1
>> 15/10/16 17:20:59 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3,
>> localhost): UnknownReason
>> 15/10/16 17:20:59 ERROR TaskSetManager: Task 0 in stage 3.0 failed 1
>> times; aborting job
>> 15/10/16 17:20:59 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose
>> tasks have all completed, from pool
>> 15/10/16 17:20:59 INFO TaskSchedulerImpl: Cancelling stage 3
>> 15/10/16 17:20:59 INFO DAGScheduler: ResultStage 3 (collect at
>> repro.scala:73) failed in 0.012 s
>> 15/10/16 17:20:59 INFO DAGScheduler: Job 3 failed: collect at
>> repro.scala:73, took 0.018694 s
>>
>>  In Spark 1.4.1, I get the following (logs are abbreviated):
>> 15/10/16 17:42:41 INFO SparkContext: Starting job: collect at
>> repro.scala:53
>> ...
>> 15/10/16 17:42:41 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
>> 15/10/16 17:42:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID
>> 2)
>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException
>> at repro.Repro$$anonfun$main$2.apply$mcZI$sp(repro.scala:49)
>> ...
>> 15/10/16 17:42:41 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2,
>> localhost): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException
>> at repro.Repro$$anonfun$main$2.apply$mcZI$sp(repro.scala:49)
>> ...
>>
>> 15/10/16 17:42:41 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1
>> times; aborting job
>> 15/10/16 17:42:41 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose
>> tasks have all completed, from pool
>> 15/10/16 17:42:41 INFO TaskSchedulerImpl: Cancelling stage 2
>> 15/10/16 17:42:41 INFO DAGScheduler: ResultStage 2 (collect at
>> repro.scala:53) failed in 0.016 s
>> 15/10/16 17:42:41 INFO DAGScheduler: Job 2 failed: collect at
>> repro.scala:53, took 0.024584 s
>>
>>
>> I have seriously screwed up somewhere or this is a change in behavior
>> that I have not been able to find in the documentation. For those that are
>> interested, a full repro and logs follow.
>>
>> Hurshal
>>
>> ---
>>
>> I am running this on Spark 1.5.1+Hadoop 2.6. I have tried this in various
>> combinations of
>>  * local/standalone mode
>>  * putting mysql on the classpath with --jars/building a fat jar with
>> mysql in it/manually running sc.addJar on the mysql jar
>>  * --deploy-mode client/--deploy-mode cluster
>> but nothing seems to change.
>>
>>
>>
>> Here is an example invocation, and the accompanying source code:
>>
>> $ ./bin/spark-submit --master local --deploy-mode client --class
>> repro.Repro /home/nix/repro/target/scala-2.10/repro-assembly-0.0.1.j

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-22 Thread Xiao Li
Actually, I found a design issue in self joins. When we have multiple-layer
projections above alias, the information of alias relation between alias
and actual columns are lost. Thus, when resolving the alias in self joins,
the rules treat the alias (e.g., in Projection) as normal columns. This
only happens when using dataFrames. When using sql, the duplicate names
after self join will stop another self join.

We need a mechanism to trace back the original/actual column for each
alias, like what RDBMS optimizers are doing. The most efficient way is to
directly store the alias-information in the node to indicate if this is
from alias; otherwise, we need to traverse the underlying tree for each
column to confirm it is not from alias even if it is not from an alias

Good luck,

Xiao Li

2015-10-21 16:33 GMT-07:00 Isabelle Phan <nlip...@gmail.com>:

> Ok, got it.
> Thanks a lot Michael for the detailed reply!
> On Oct 21, 2015 1:54 PM, "Michael Armbrust" <mich...@databricks.com>
> wrote:
>
>> Yeah, I was suggesting that you avoid using  
>> org.apache.spark.sql.DataFrame.apply(colName:
>> String) when you are working with selfjoins as it eagerly binds to a
>> specific column in a what that breaks when we do the rewrite of one side of
>> the query.  Using the apply method constructs a resolved column eagerly
>> (which looses the alias information).
>>
>> On Wed, Oct 21, 2015 at 1:49 PM, Isabelle Phan <nlip...@gmail.com> wrote:
>>
>>> Thanks Michael and Ali for the reply!
>>>
>>> I'll make sure to use unresolved columns when working with self joins
>>> then.
>>>
>>> As pointed by Ali, isn't there still an issue with the aliasing? It
>>> works when using org.apache.spark.sql.functions.col(colName: String)
>>> method, but not when using org.apache.spark.sql.DataFrame.apply(colName:
>>> String):
>>>
>>> scala> j.select(col("lv.value")).show
>>> +-+
>>> |value|
>>> +-+
>>> |   10|
>>> |   20|
>>> +-+
>>>
>>>
>>> scala> j.select(largeValues("lv.value")).show
>>> +-+
>>> |value|
>>> +-+
>>> |1|
>>> |5|
>>> +-+
>>>
>>> Or does this behavior have the same root cause as detailed in Michael's
>>> email?
>>>
>>>
>>> -Isabelle
>>>
>>>
>>>
>>>
>>> On Wed, Oct 21, 2015 at 11:46 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> Unfortunately, the mechanisms that we use to differentiate columns
>>>> automatically don't work particularly well in the presence of self joins.
>>>> However, you can get it work if you use the $"column" syntax
>>>> consistently:
>>>>
>>>> val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 
>>>> 10)).toDF("key", "value")val smallValues = df.filter('value < 
>>>> 10).as("sv")val largeValues = df.filter('value >= 10).as("lv")
>>>> ​
>>>> smallValues
>>>>   .join(largeValues, $"sv.key" === $"lv.key")
>>>>   .select($"sv.key".as("key"), $"sv.value".as("small_value"), 
>>>> $"lv.value".as("large_value"))
>>>>   .withColumn("diff", $"small_value" - $"large_value")
>>>>   .show()
>>>> +---+---+---++|key|small_value|large_value|diff|+---+---+---++|
>>>>   1|  1| 10|  -9||  3|  5| 20| 
>>>> -15|+---+---+---++
>>>>
>>>>
>>>> The problem with the other cases is that calling
>>>> smallValues("columnName") or largeValues("columnName") is eagerly
>>>> resolving the attribute to the same column (since the data is actually
>>>> coming from the same place).  By the time we realize that you are joining
>>>> the data with itself (at which point we rewrite one side of the join to use
>>>> different expression ids) its too late.  At the core the problem is that in
>>>> Scala we have no easy way to differentiate largeValues("columnName")
>>>> from smallValues("columnName").  This is because the data is coming
>>>> from the same DataFrame and we don't actually know which variable name you
>>>> are using.  There are things we can change here, but its pretty hard to
>>>> change the semantics without breaking other use cases.
>>>>
>>>> So, this isn't a straight forward "bug", but its definitely a usability
>>>> issue.  For now, my advice would be: only use unresolved columns (i.e.
>>>> $"[alias.]column" or col("[alias.]column")) when working with self
>>>> joins.
>>>>
>>>> Michael
>>>>
>>>
>>>
>>


Re: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Xiao Li
Hi, Saif,

Could you post your code here? It might help others reproduce the errors
and give you a correct answer.

Thanks,

Xiao Li

2015-10-22 8:27 GMT-07:00 <saif.a.ell...@wellsfargo.com>:

> Hello everyone,
>
> I am doing some analytics experiments under a 4 server stand-alone cluster
> in a spark shell, mostly involving a huge database with groupBy and
> aggregations.
>
> I am picking 6 groupBy columns and returning various aggregated results in
> a dataframe. GroupBy fields are of two types, most of them are StringType
> and the rest are LongType.
>
> The data source is a splitted json file dataframe,  once the data is
> persisted, the result is consistent. But if I unload the memory and reload
> the data, the groupBy action returns different content results, missing
> data.
>
> Could I be missing something? this is rather serious for my analytics, and
> not sure how to properly diagnose this situation.
>
> Thanks,
> Saif
>
>


Re: Multiple joins in Spark

2015-10-20 Thread Xiao Li
Are you using hiveContext?

First, build your Spark using the following command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver
-DskipTests clean package

Then, try this sample program

object SimpleApp {
  case class Individual(name: String, surname: String, birthDate: String)

  def main(args: Array[String]) {
val sc = new SparkContext("local", "join DFs")
//val sqlContext = new SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

val rdd = sc.parallelize(Seq(
  Individual("a", "c", "10/10/1972"),
  Individual("b", "d", "10/11/1970"),
))

val df = hiveContext.createDataFrame(rdd)

df.registerTempTable("tab")

val dfHive = hiveContext.sql("select * from tab")

dfHive.show()
  }
}


2015-10-20 6:24 GMT-07:00 Shyam Parimal Katti <spk...@nyu.edu>:

> When I do the steps above and run a query like this:
>
> sqlContext.sql("select * from ...")
>
> I get exception:
>
> org.apache.spark.sql.AnalysisException: Non-local session path expected to
> be non-null;
>at org.apache.spark.sql.hive.HiveQL$.createPlan(HiveQl.scala:260)
>.
>
> I cannot paste the entire stack since it's on a company laptop and I am
> not allowed to copy paste things! Though if absolutely needed to help, I
> can figure out some way to provide it.
>
> On Sat, Oct 17, 2015 at 1:13 AM, Xiao Li <gatorsm...@gmail.com> wrote:
>
>> Hi, Shyam,
>>
>> The method registerTempTable is to register a [DataFrame as a temporary
>> table in the Catalog using the given table name.
>>
>> In the Catalog, Spark maintains a concurrent hashmap, which contains the
>> pair of the table names and the logical plan.
>>
>> For example, when we submit the following query,
>>
>> SELECT * FROM inMemoryDF
>>
>> The concurrent hashmap contains one map from name to the Logical Plan:
>>
>> "inMemoryDF" -> "LogicalRDD [c1#0,c2#1,c3#2,c4#3], MapPartitionsRDD[1] at
>> createDataFrame at SimpleApp.scala:42
>>
>> Therefore, using SQL will not hurt your performance. The actual physical
>> plan to execute your SQL query is generated by the result of Catalyst
>> optimizer.
>>
>> Good luck,
>>
>> Xiao Li
>>
>>
>>
>> 2015-10-16 20:53 GMT-07:00 Shyam Parimal Katti <spk...@nyu.edu>:
>>
>>> Thanks Xiao! Question about the internals, would you know what happens
>>> when createTempTable() is called? I. E.  Does it create an RDD internally
>>> or some internal representation that lets it join with  an RDD?
>>>
>>> Again, thanks for the answer.
>>> On Oct 16, 2015 8:15 PM, "Xiao Li" <gatorsm...@gmail.com> wrote:
>>>
>>>> Hi, Shyam,
>>>>
>>>> You still can use SQL to do the same thing in Spark:
>>>>
>>>> For example,
>>>>
>>>> val df1 = sqlContext.createDataFrame(rdd)
>>>> val df2 = sqlContext.createDataFrame(rdd2)
>>>> val df3 = sqlContext.createDataFrame(rdd3)
>>>> df1.registerTempTable("tab1")
>>>> df2.registerTempTable("tab2")
>>>> df3.registerTempTable("tab3")
>>>> val exampleSQL = sqlContext.sql("select * from tab1, tab2, tab3
>>>> where tab1.name = tab2.name and tab2.id = tab3.id")
>>>>
>>>> Good luck,
>>>>
>>>> Xiao Li
>>>>
>>>> 2015-10-16 17:01 GMT-07:00 Shyam Parimal Katti <spk...@nyu.edu>:
>>>>
>>>>> Hello All,
>>>>>
>>>>> I have a following SQL query like this:
>>>>>
>>>>> select a.a_id, b.b_id, c.c_id from table_a a join table_b b on a.a_id
>>>>> = b.a_id join table_c c on b.b_id = c.b_id
>>>>>
>>>>> In scala i have done this so far:
>>>>>
>>>>> table_a_rdd = sc.textFile(...)
>>>>> table_b_rdd = sc.textFile(...)
>>>>> table_c_rdd = sc.textFile(...)
>>>>>
>>>>> val table_a_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>>>> (line(0), line))
>>>>> val table_b_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>>>> (line(0), line))
>>>>> val table_c_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>>>> (line(0), line))
>>>>>
>>>>> Each line has the first value at its primary key.
>>>>>
>>>>> While I can join 2 RDDs using table_a_rowRDD.join(table_b_rowRDD) to
>>>>> join, is it possible to join multiple RDDs in a single expression? like
>>>>> table_a_rowRDD.join(table_b_rowRDD).join(table_c_rowRDD) ? Also, how can I
>>>>> specify the column on which I can join multiple RDDs?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>


Re: [spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-20 Thread Xiao Li
Just curious why you are using parseSql APIs?

It works well if you use the external APIs. For example, in your case:

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `t`(`id` STRING, `foo`
INT) PARTITIONED BY (year INT, month INT, day INT) STORED AS PARQUET
Location 'temp'")

Good luck,

Xiao Li


2015-10-20 10:23 GMT-07:00 Michael Armbrust <mich...@databricks.com>:

> Thats not really intended to be a public API as there is some internal
> setup that needs to be done for Hive to work.  Have you created a
> HiveContext in the same thread?  Is there more to that stacktrace?
>
> On Tue, Oct 20, 2015 at 2:25 AM, Ayoub <benali.ayoub.i...@gmail.com>
> wrote:
>
>> Hello,
>>
>> when upgrading to spark 1.5.1 from 1.4.1 the following code crashed on
>> runtime. It is mainly used to parse HiveQL queries and check that they are
>> valid.
>>
>> package org.apache.spark.sql.hive
>>
>> val sql = "CREATE EXTERNAL TABLE IF NOT EXISTS `t`(`id` STRING, `foo` INT)
>> PARTITIONED BY (year INT, month INT, day INT) STORED AS PARQUET Location
>> 'temp'"
>>
>> HiveQl.parseSql(sql)
>>
>> org.apache.spark.sql.AnalysisException: null;
>> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
>> at
>>
>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>> at
>>
>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
>> at
>> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>> at
>> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>> at
>>
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>> at
>>
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>> at
>>
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>> at
>>
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>> at
>> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>> at
>>
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>> at
>>
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>> at
>>
>> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>> at
>>
>> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>> at
>>
>> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>> at
>>
>> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
>> at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:277)
>> at
>> org.apache.spark.sql.hive.SQLChecker$$anonfun$1.apply(SQLChecker.scala:9)
>> at
>> org.apache.spark.sql.hive.SQLChecker$$anonfun$1.apply(SQLChecker.scala:9)
>>
>> Should that be done differently on spark 1.5.1 ?
>>
>> Thanks,
>> Ayoub
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark1-5-1-HiveQl-parse-throws-org-apache-spark-sql-AnalysisException-null-tp25138.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Xiao Li
Let me share my 2 cents.

First, this is not documented in the official document. Maybe we should do
it? http://spark.apache.org/docs/latest/sql-programming-guide.html

Second, nullability is a significant concept in the database people. It is
part of schema. Extra codes are needed for evaluating if a value is null
for all the nullable data types. Thus, it might cause a problem if you need
to use Spark to transfer the data between parquet and RDBMS. My suggestion
is to introduce another external parameter?

Thanks,

Xiao Li


2015-10-20 10:20 GMT-07:00 Michael Armbrust <mich...@databricks.com>:

> For compatibility reasons, we always write data out as nullable in
> parquet.  Given that that bit is only an optimization that we don't
> actually make much use of, I'm curious why you are worried that its
> changing to true?
>
> On Tue, Oct 20, 2015 at 8:24 AM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Spark users and developers,
>>
>> I have a dataframe with the following schema (Spark 1.5.1):
>>
>> StructType(StructField(type,StringType,true),
>> StructField(timestamp,LongType,false))
>>
>> After I save the dataframe in parquet and read it back, I get the
>> following schema:
>>
>> StructType(StructField(timestamp,LongType,true),
>> StructField(type,StringType,true))
>>
>> As you can see the schema does not match. The nullable field is set to
>> true for timestamp upon reading the dataframe back. Is there a way to
>> preserve the schema so that what we write to will be what we read back?
>>
>> Best Regards,
>>
>> Jerry
>>
>
>


Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Xiao Li
Sure. Will try to do a pull request this week.

Schema evolution is always painful for database people. IMO, NULL is a bad
design in the original system R. It introduces a lot of problems during the
system migration and data integration.

Let me find a possible scenario: RDBMS is used as an ODS. Spark is used as
an external online data analysis engine. The results could be stored in
Parquet files and inserted back RDBMS every interval. In this case, we
could face a few options:

- Change the data types of columns in RDBMS tables to support the possible
nullable values and the logics of RDBMS applications that consume these
results must also support NULL. When the applications are third-party,
changing the applications become harder.

- As what you suggested, before loading the data from the Parquet files, we
need to add an extra step to do a possible data cleaning, value
transformation or exception reporting in case of finding NULL.

If having such an external parameter, when writing data schema to external
data store, Spark will do its best to keep the original schema without any
change (e.g., keep the initial definition of nullability). If some data
type/schema conversions are not avoidable, it will issue warnings or errors
to the users. Does that make sense?

Thanks,

Xiao Li






 In this case,



2015-10-20 12:38 GMT-07:00 Michael Armbrust <mich...@databricks.com>:

> First, this is not documented in the official document. Maybe we should do
>> it? http://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>
> Pull requests welcome.
>
>
>> Second, nullability is a significant concept in the database people. It
>> is part of schema. Extra codes are needed for evaluating if a value is null
>> for all the nullable data types. Thus, it might cause a problem if you need
>> to use Spark to transfer the data between parquet and RDBMS. My suggestion
>> is to introduce another external parameter?
>>
>
> Sure, but a traditional RDBMS has the opportunity to do validation before
> loading data in.  Thats not really an option when you are reading random
> files from S3.  This is why Hive and many other systems in this space treat
> all columns as nullable.
>
> What would the semantics of this proposed external parameter be?
>


Re: Dynamic partition pruning

2015-10-16 Thread Xiao Li
Hi, Younes,

Maybe you can open a JIRA?

Thanks,

Xiao Li

2015-10-16 12:43 GMT-07:00 Younes Naguib <younes.nag...@tritondigital.com>:

> Thanks,
>
> Do you have a Jira I can follow for this?
>
>
>
> y
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* October-16-15 2:18 PM
> *To:* Younes Naguib
> *Cc:* user@spark.apache.org
> *Subject:* Re: Dynamic partition pruning
>
>
>
> We don't support dynamic partition pruning yet.
>
>
>
> On Fri, Oct 16, 2015 at 10:20 AM, Younes Naguib <
> younes.nag...@tritondigital.com> wrote:
>
> Hi all
>
>
>
> I’m running sqls on spark 1.5.1 and using tables based on parquets.
>
> My tables are not pruned when joined on partition columns.
>
> Ex:
>
> Select  from tab where partcol=1 will prune on value 1
>
> Select  from tab join dim on (dim.partcol=tab.partcol) where
> dim.partcol=1 will scan all partitions.
>
>
>
> Any ideas or workarounds?
>
>
>
>
>
> *Thanks,*
>
> *Younes*
>
>
>
>
>


Re: Problem of RDD in calculation

2015-10-16 Thread Xiao Li
Hi, Frank,

After registering these DF as a temp table (via the API registerTempTable),
you can do it using SQL. I believe this should be much easier.

Good luck,

Xiao Li

2015-10-16 12:10 GMT-07:00 ChengBo <cheng...@huawei.com>:

> Hi all,
>
>
>
> I am new in Spark, and I have a question in dealing with RDD.
>
> I’ve converted RDD to DataFrame. So there are two DF: DF1 and DF2
>
> DF1 contains: userID, time, dataUsage, duration
>
> DF2 contains: userID
>
>
>
> Each userID has multiple rows in DF1.
>
> DF2 has distinct userID, and I would like to compute the average, max and
> min value of both dataUsage and duration for each userID in DF1?
>
> And store the results in a new dataframe.
>
> How can I do that?
>
> Thanks a lot.
>
>
>
> Best
>
> Frank
>


Re: Multiple joins in Spark

2015-10-16 Thread Xiao Li
Hi, Shyam,

You still can use SQL to do the same thing in Spark:

For example,

val df1 = sqlContext.createDataFrame(rdd)
val df2 = sqlContext.createDataFrame(rdd2)
val df3 = sqlContext.createDataFrame(rdd3)
df1.registerTempTable("tab1")
df2.registerTempTable("tab2")
df3.registerTempTable("tab3")
val exampleSQL = sqlContext.sql("select * from tab1, tab2, tab3 where
tab1.name = tab2.name and tab2.id = tab3.id")

Good luck,

Xiao Li

2015-10-16 17:01 GMT-07:00 Shyam Parimal Katti <spk...@nyu.edu>:

> Hello All,
>
> I have a following SQL query like this:
>
> select a.a_id, b.b_id, c.c_id from table_a a join table_b b on a.a_id =
> b.a_id join table_c c on b.b_id = c.b_id
>
> In scala i have done this so far:
>
> table_a_rdd = sc.textFile(...)
> table_b_rdd = sc.textFile(...)
> table_c_rdd = sc.textFile(...)
>
> val table_a_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
> (line(0), line))
> val table_b_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
> (line(0), line))
> val table_c_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
> (line(0), line))
>
> Each line has the first value at its primary key.
>
> While I can join 2 RDDs using table_a_rowRDD.join(table_b_rowRDD) to join,
> is it possible to join multiple RDDs in a single expression? like
> table_a_rowRDD.join(table_b_rowRDD).join(table_c_rowRDD) ? Also, how can I
> specify the column on which I can join multiple RDDs?
>
>
>
>
>


Re: Problem of RDD in calculation

2015-10-16 Thread Xiao Li
For most programmers, dataFrames are preferred thanks to the flexibility,
but using sql syntax is a great option for users who feel more comfortable
using SQL. : )

2015-10-16 18:22 GMT-07:00 Ali Tajeldin EDU :

> Since DF2 only has the userID, I'm assuming you are musing DF2 to filter
> for desired userIDs.
> You can just use the join() and groupBy operations on DataFrame to do what
> you desire.  For example:
>
> scala> val df1=app.createDF("id:String; v:Integer", "X,1;X,2;Y,3;Y,4;Z,10")
> df1: org.apache.spark.sql.DataFrame = [id: string, v: int]
>
> scala> df1.show
> +---+---+
> | id|  v|
> +---+---+
> |  X|  1|
> |  X|  2|
> |  Y|  3|
> |  Y|  4|
> |  Z| 10|
> +---+---+
>
> scala> val df2=app.createDF("id:String", "X;Y")
> df2: org.apache.spark.sql.DataFrame = [id: string]
>
> scala> df2.show
> +---+
> | id|
> +---+
> |  X|
> |  Y|
> +---+
>
> scala> df1.join(df2, "id").groupBy("id").agg(avg("v") as "avg_v", min("v")
> as "min_v").show
> +---+-+-+
> | id|avg_v|min_v|
> +---+-+-+
> |  X|  1.5|1|
> |  Y|  3.5|3|
> |---+-+-+
>
>
> Notes:
> * The above uses createDF method in SmvApp from SMV package, but the rest
> of the code is just standard Spark DataFrame ops.
> * One advantage of doing this using DataFrame rather than SQL is that you
> can build the expressions programmatically (e.g. imagine doing this for 100
> columns instead of 2).
>
> ---
> Ali
>
>
> On Oct 16, 2015, at 12:10 PM, ChengBo  wrote:
>
> Hi all,
>
> I am new in Spark, and I have a question in dealing with RDD.
> I’ve converted RDD to DataFrame. So there are two DF: DF1 and DF2
> DF1 contains: userID, time, dataUsage, duration
> DF2 contains: userID
>
> Each userID has multiple rows in DF1.
> DF2 has distinct userID, and I would like to compute the average, max and
> min value of both dataUsage and duration for each userID in DF1?
> And store the results in a new dataframe.
> How can I do that?
> Thanks a lot.
>
> Best
> Frank
>
>
>


Re: Multiple joins in Spark

2015-10-16 Thread Xiao Li
Hi, Shyam,

The method registerTempTable is to register a [DataFrame as a temporary
table in the Catalog using the given table name.

In the Catalog, Spark maintains a concurrent hashmap, which contains the
pair of the table names and the logical plan.

For example, when we submit the following query,

SELECT * FROM inMemoryDF

The concurrent hashmap contains one map from name to the Logical Plan:

"inMemoryDF" -> "LogicalRDD [c1#0,c2#1,c3#2,c4#3], MapPartitionsRDD[1] at
createDataFrame at SimpleApp.scala:42

Therefore, using SQL will not hurt your performance. The actual physical
plan to execute your SQL query is generated by the result of Catalyst
optimizer.

Good luck,

Xiao Li



2015-10-16 20:53 GMT-07:00 Shyam Parimal Katti <spk...@nyu.edu>:

> Thanks Xiao! Question about the internals, would you know what happens
> when createTempTable() is called? I. E.  Does it create an RDD internally
> or some internal representation that lets it join with  an RDD?
>
> Again, thanks for the answer.
> On Oct 16, 2015 8:15 PM, "Xiao Li" <gatorsm...@gmail.com> wrote:
>
>> Hi, Shyam,
>>
>> You still can use SQL to do the same thing in Spark:
>>
>> For example,
>>
>> val df1 = sqlContext.createDataFrame(rdd)
>> val df2 = sqlContext.createDataFrame(rdd2)
>> val df3 = sqlContext.createDataFrame(rdd3)
>> df1.registerTempTable("tab1")
>> df2.registerTempTable("tab2")
>> df3.registerTempTable("tab3")
>> val exampleSQL = sqlContext.sql("select * from tab1, tab2, tab3 where
>> tab1.name = tab2.name and tab2.id = tab3.id")
>>
>> Good luck,
>>
>> Xiao Li
>>
>> 2015-10-16 17:01 GMT-07:00 Shyam Parimal Katti <spk...@nyu.edu>:
>>
>>> Hello All,
>>>
>>> I have a following SQL query like this:
>>>
>>> select a.a_id, b.b_id, c.c_id from table_a a join table_b b on a.a_id =
>>> b.a_id join table_c c on b.b_id = c.b_id
>>>
>>> In scala i have done this so far:
>>>
>>> table_a_rdd = sc.textFile(...)
>>> table_b_rdd = sc.textFile(...)
>>> table_c_rdd = sc.textFile(...)
>>>
>>> val table_a_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>> (line(0), line))
>>> val table_b_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>> (line(0), line))
>>> val table_c_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>> (line(0), line))
>>>
>>> Each line has the first value at its primary key.
>>>
>>> While I can join 2 RDDs using table_a_rowRDD.join(table_b_rowRDD) to
>>> join, is it possible to join multiple RDDs in a single expression? like
>>> table_a_rowRDD.join(table_b_rowRDD).join(table_c_rowRDD) ? Also, how can I
>>> specify the column on which I can join multiple RDDs?
>>>
>>>
>>>
>>>
>>>
>>


Re: How to speed up reading from file?

2015-10-16 Thread Xiao Li
Hi, Saif,

The optimal number of rows per partition depends on many factors, right?
for example, your row size, your file system configuration, your
replication configuration and the performance of your underlying hardware.
The best way is to do the performance testing and tuning your
configurations. Generally, if each batch contains just a few MB, the
performance is bad compared with a bigger batch.

Check the following paper regarding the performance of Spark and MR,
http://www.vldb.org/pvldb/vol8/p2110-shi.pdf. It might help you understand
your use case. For example, caching can be used in your system.

Good luck,

Xiao Li

2015-10-16 14:08 GMT-07:00 <saif.a.ell...@wellsfargo.com>:

> Hello,
>
> Is there an optimal number of partitions per number of rows, when writing
> into disk, so we can re-read later from source in a distributed way?
> Any  thoughts?
>
> Thanks
> Saif
>
>


Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-12 Thread Xiao Li
Hi, Lloyd,

Both runs are cold/warm? Memory/cache hit/miss could be a big factor if
your application is IO intensive. You need to monitor your system to
understand what is your bottleneck.

Good lucks,

Xiao Li


Re: Best practices to call small spark jobs as part of REST api

2015-10-12 Thread Xiao Li
The design majorly depends on your use cases. You have to think about the
requirements and rank them.

For example, if your application cares the response time and is ok to read
the stale data, using a nosql database as a middleware is a good option.

Good Luck,

Xiao Li


2015-10-11 21:00 GMT-07:00 Nuthan Kumar <mnut...@gmail.com>:

> If the data is also on-demand, spark as back end is also good option..
>
>
>
> Sent from Outlook Mail <http://go.microsoft.com/fwlink/?LinkId=550987>
> for Windows 10 phone
>
>
>
>
>
>
> *From: *Akhil Das
> *Sent: *Sunday, October 11, 2015 1:32 AM
> *To: *unk1102
> *Cc: *user@spark.apache.org
> *Subject: *Re: Best practices to call small spark jobs as part of REST api
>
>
>
>
>
> One approach would be to make your spark job push the computed results
> (json) to a database and your reset server can pull it from there and power
> the UI.
>
>
> Thanks
>
> Best Regards
>
>
>
> On Wed, Sep 30, 2015 at 12:26 AM, unk1102 <umesh.ka...@gmail.com> wrote:
>
> Hi I would like to know any best practices to call spark jobs in rest api.
> My
> Spark jobs returns results as json and that json can be used by UI
> application.
>
> Should we even have direct HDFS/Spark backend layer in UI for on demand
> queries? Please guide. Thanks much.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-to-call-small-spark-jobs-as-part-of-REST-api-tp24872.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
>


Re: Spark handling parallel requests

2015-10-12 Thread Xiao Li
Hi, Tarek,

It is hard to answer your question. Are these requests similar? Caching
your results or intermediate results in your applications? Or does that
mean your throughput requirement is very high? Throttling the number of
concurrent requests? ...

As Akhil said, Kafka might help in your case. Otherwise, you need to read
the designs or even source codes of Kafka and Spark Streaming.

 Best wishes,

Xiao Li


2015-10-11 23:19 GMT-07:00 Akhil Das <ak...@sigmoidanalytics.com>:

> Instead of pushing your requests to the socket, why don't you push them to
> a Kafka or any other message queue and use spark streaming to process them?
>
> Thanks
> Best Regards
>
> On Mon, Oct 5, 2015 at 6:46 PM, <tarek.abouzei...@yahoo.com.invalid>
> wrote:
>
>> Hi ,
>> i am using Scala , doing a socket program to catch multiple requests at
>> same time and then call a function which uses spark to handle each process
>> , i have a multi-threaded server to handle the multiple requests and pass
>> each to spark , but there's a bottleneck as the spark doesn't initialize a
>> sub task for the new request , is it even possible to do parallel
>> processing using single spark job ?
>> Best Regards,
>>
>> --  Best Regards, -- Tarek Abouzeid
>>
>
>


Re: Kafka and Spark combination

2015-10-09 Thread Xiao Li
Please see the following discussion:

http://search-hadoop.com/m/YGbbS0SqClMW5T1

Thanks,

Xiao Li

2015-10-09 6:17 GMT-07:00 Nikhil Gs <gsnikhil1432...@gmail.com>:

> Has anyone worked with Kafka in a scenario where the Streaming data from
> the Kafka consumer is picked by Spark (Java) functionality and directly
> placed in Hbase.
>
> Regards,
> Gs.
>


Re: Best storage format for intermediate process

2015-10-09 Thread Xiao Li
Hi, Saif,

This depends on your use cases. For example, you want to do a table scan
every time? or you want to get a specific row? or you want to get a
temporal query? Do you have a security concern when you choose your
target-side data store?

Offloading a huge table is also very expensive. It is time consuming. If
the source side is mainframe, it could also eat a lot of MIPS. Thus, the
best way is to save it in a persistent media without any data
transformation and then transform and store them based on your query types.

Thanks,

Xiao Li


2015-10-09 11:25 GMT-07:00 <saif.a.ell...@wellsfargo.com>:

> Hi all,
>
> I am in the procss of learning big data.
> Right now, I am bringing huge databases through JDBC to Spark (a 250
> million rows table can take around 3 hours), and then re-saving it into
> JSON, which is fast, simple, distributed, fail-safe and stores data types,
> although without any compression.
>
> Reading from distributed JSON takes for this amount of data, around 2-3
> minutes and works good enough for me. But, do you suggest or prefer any
> other format for intermediate storage, for fast and proper types reading?
> Not only for intermediate between a network database, but also for
> intermediate dataframe transformations to have data ready for processing.
>
> I have tried CSV but computational type inferring does not usually fit my
> needs and take long types. Haven’t tried parquet since they fixed it for
> 1.5, but that is also another option.
> What do you also think of HBase, Hive or any other type?
>
> Looking for insights!
> Saif
>
>


Re: Datastore or DB for spark

2015-10-09 Thread Xiao Li
FYI, in my local environment, Spark is connected to DB2 on z/OS but that
requires a special JDBC driver.

Xiao Li


2015-10-09 8:38 GMT-07:00 Rahul Jeevanandam <rahu...@incture.com>:

> Hi Jörn Franke
>
> I was sure that relational database wouldn't be a good option for Spark.
> But what about distributed databases like Hbase, Cassandra, etc?
>
> On Fri, Oct 9, 2015 at 7:21 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> I am not aware of any empirical evidence, but I think hadoop (HDFS) as a
>> datastore for Spark is quiet common. With relational databases you usually
>> do not have so much data and you do not benefit from data locality.
>>
>> Le ven. 9 oct. 2015 à 15:16, Rahul Jeevanandam <rahu...@incture.com> a
>> écrit :
>>
>>> I wanna know what everyone are using. Which datastore is popular among
>>> Spark community.
>>>
>>> On Fri, Oct 9, 2015 at 6:16 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> There are connectors for hbase, Cassandra, etc.
>>>>
>>>> Which data store do you use now ?
>>>>
>>>> Cheers
>>>>
>>>> On Oct 9, 2015, at 3:10 AM, Rahul Jeevanandam <rahu...@incture.com>
>>>> wrote:
>>>>
>>>> Hi Guys,
>>>>
>>>>  I wanted to know what is the databases that you associate with spark?
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> *Rahul J*
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> *Rahul J*
>>>
>>
>
>
> --
> Regards,
> *Rahul J*
> Associate Architect – Technology
> Incture <http://www.incture.com/>
>